### Prompt: 

Netflix wants to identify similar movies based on movie characteristics.

Go out and find a dataset of interest. It could be one that helps you work on one of our recommended research questions, or any other dataset that addresses an unsupervised learning question of your own.

Explore the data. Get to know the data. Spend a lot of time going over its quirks and peccadilloes. You should understand how it was gathered, what's in it, and what the variables look like.

You should try several different approaches and really work to tune a variety of models before choosing what you consider to be the best performer.

### Things to keep in mind: how do clustering and modeling compare? What are the advantages of each? Why would you want to use one over the other?

### This will ultimately include the following deliverables:

A Jupyter notebook that tells a compelling story about your data (to be submitted at the end of this checkpoint). <br>
A 15 to 30 minute presentation of your findings. You'll need to produce a deck and present it to the Thinkful community.

### Let's start by importing our data, doing some data exploration and cleaning if it's required and getting it ready for some clustering! 

In [1]:
import numpy as np
import pandas as pd
import scipy
import itertools
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import normalize
from sklearn.decomposition import PCA
from sklearn.cluster import MiniBatchKMeans
%matplotlib inline

from ast import literal_eval
import os
import pandas as pd


In [2]:
df_movies = pd.read_csv('/Users/ir3n3br4t515/Desktop/movies_metadata.csv')
df_ratings = pd.read_csv('/Users/ir3n3br4t515/Desktop/ratings.csv')
df_links = pd.read_csv('/Users/ir3n3br4t515/Desktop/links.csv')


  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
df_movies.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


### The dataset has the following features:

budget - The budget in which the movie was made.<br>
genre - The genre of the movie, Action, Comedy ,Thriller etc.<br>
homepage - A link to the homepage of the movie.<br>
id - This is infact the movie_id as in the first dataset.<br>
keywords - The keywords or tags related to the movie.<br>
original_language - The language in which the movie was made.<br>
original_title - The title of the movie before translation or adaptation.<br>
overview - A brief description of the movie.<br>
popularity - A numeric quantity specifying the movie popularity.<br>
production_companies - The production house of the movie.<br>
production_countries - The country in which it was produced.<br>
release_date - The date on which it was released.<br>
revenue - The worldwide revenue generated by the movie.<br>
runtime - The running time of the movie in minutes.<br>
status - "Released" or "Rumored".<br>
tagline - Movie's tagline.<br>
title - Title of the movie.<br>
vote_average - average ratings the movie recieved.<br>
vote_count - the count of votes recieved.<br>

### For clustering, we will be focusing on the int columns of:

Budget<br>
id (Movie Id)<br>
Release Date (year)<br>
Revenue<br>
Runtime<br>
vote_average<br>
vote_count<br>
Title - Though this is not a numerical value, we will be using the Title later to do some sample checking on our data to make sure the clusters intuitively make sense! 



### Below we will get a sense of hour our int categories look so that we can later use them for clustering. 

Our genres are binarized, which is great. <br>
Our budget, id and release date columns still need to be transfered to integers.<br> 
Let's do that now! 


In [4]:
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
adult                    45466 non-null object
belongs_to_collection    4494 non-null object
budget                   45466 non-null object
genres                   45466 non-null object
homepage                 7782 non-null object
id                       45466 non-null object
imdb_id                  45449 non-null object
original_language        45455 non-null object
original_title           45466 non-null object
overview                 44512 non-null object
popularity               45461 non-null object
poster_path              45080 non-null object
production_companies     45463 non-null object
production_countries     45463 non-null object
release_date             45379 non-null object
revenue                  45460 non-null float64
runtime                  45203 non-null float64
spoken_languages         45460 non-null object
status                   45379 non-null objec

In [5]:
df_movies.revenue = pd.to_numeric(df_movies.revenue, errors= 'coerce')
df_movies.id = pd.to_numeric(df_movies.id, errors= 'coerce')
df_movies.budget = pd.to_numeric(df_movies.budget, errors= 'coerce')



In [6]:
df_movies['release_date'] = pd.to_datetime(df_movies['release_date'], errors = 'coerce').dt.year




In [7]:
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
adult                    45466 non-null object
belongs_to_collection    4494 non-null object
budget                   45463 non-null float64
genres                   45466 non-null object
homepage                 7782 non-null object
id                       45463 non-null float64
imdb_id                  45449 non-null object
original_language        45455 non-null object
original_title           45466 non-null object
overview                 44512 non-null object
popularity               45461 non-null object
poster_path              45080 non-null object
production_companies     45463 non-null object
production_countries     45463 non-null object
release_date             45376 non-null float64
revenue                  45460 non-null float64
runtime                  45203 non-null float64
spoken_languages         45460 non-null object
status                   45379 non-null ob

### Checking again to make sure we have integers in all categories modeled in the K means clustering algorithm. Everything looks good! And once again, we have our integer columns of:
<br><br>

Budget<br>
id (Movie Id)<br>
Release Date (year)<br>
Revenue<br>
Runtime<br>
vote_average<br>
vote_count<br>

### Now the nulls.

In [8]:
df_movies.isnull().sum()


adult                        0
belongs_to_collection    40972
budget                       3
genres                       0
homepage                 37684
id                           3
imdb_id                     17
original_language           11
original_title               0
overview                   954
popularity                   5
poster_path                386
production_companies         3
production_countries         3
release_date                90
revenue                      6
runtime                    263
spoken_languages             6
status                      87
tagline                  25054
title                        6
video                        6
vote_average                 6
vote_count                   6
dtype: int64

In [9]:
df_movies[['id', 'budget', 'vote_average', 'vote_count', 'revenue', 'runtime', 'release_date']].dropna(inplace = True)




A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [10]:
df_movies.reset_index(drop = True, inplace = True)

In [11]:
df_movies.isnull().sum()


adult                        0
belongs_to_collection    40972
budget                       3
genres                       0
homepage                 37684
id                           3
imdb_id                     17
original_language           11
original_title               0
overview                   954
popularity                   5
poster_path                386
production_companies         3
production_countries         3
release_date                90
revenue                      6
runtime                    263
spoken_languages             6
status                      87
tagline                  25054
title                        6
video                        6
vote_average                 6
vote_count                   6
dtype: int64

In [12]:
df_movies.isnull().sum()


adult                        0
belongs_to_collection    40972
budget                       3
genres                       0
homepage                 37684
id                           3
imdb_id                     17
original_language           11
original_title               0
overview                   954
popularity                   5
poster_path                386
production_companies         3
production_countries         3
release_date                90
revenue                      6
runtime                    263
spoken_languages             6
status                      87
tagline                  25054
title                        6
video                        6
vote_average                 6
vote_count                   6
dtype: int64

### Ok. Now lets move onto Genres. 

In [13]:
list (df_movies.genres)

["[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]",
 "[{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 10751, 'name': 'Family'}]",
 "[{'id': 10749, 'name': 'Romance'}, {'id': 35, 'name': 'Comedy'}]",
 "[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]",
 "[{'id': 35, 'name': 'Comedy'}]",
 "[{'id': 28, 'name': 'Action'}, {'id': 80, 'name': 'Crime'}, {'id': 18, 'name': 'Drama'}, {'id': 53, 'name': 'Thriller'}]",
 "[{'id': 35, 'name': 'Comedy'}, {'id': 10749, 'name': 'Romance'}]",
 "[{'id': 28, 'name': 'Action'}, {'id': 12, 'name': 'Adventure'}, {'id': 18, 'name': 'Drama'}, {'id': 10751, 'name': 'Family'}]",
 "[{'id': 28, 'name': 'Action'}, {'id': 12, 'name': 'Adventure'}, {'id': 53, 'name': 'Thriller'}]",
 "[{'id': 12, 'name': 'Adventure'}, {'id': 28, 'name': 'Action'}, {'id': 53, 'name': 'Thriller'}]",
 "[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10

### Let's start by breaking up these long lists of genres into binary columns for each Genre

In [14]:
# run the following two lines only once
df_movies["genres"] = df_movies.genres.apply(lambda x: literal_eval(x))
df_movies["genres"] = df_movies.genres.apply(lambda x: [el["name"] for el in x])

In [15]:
unique_genres = set()
for col in df_movies.genres:
    for genr in col:
        unique_genres.add(genr)

print(unique_genres)

{'Animation', 'Telescene Film Group Productions', 'History', 'Science Fiction', 'Documentary', 'Foreign', 'Carousel Productions', 'Thriller', 'TV Movie', 'Mardock Scramble Production Committee', 'War', 'Sentai Filmworks', 'Fantasy', 'Action', 'GoHands', 'Drama', 'Rogue State', 'Crime', 'Adventure', 'Romance', 'Horror', 'The Cartel', 'Aniplex', 'BROSTA TV', 'Pulser Productions', 'Vision View Entertainment', 'Mystery', 'Odyssey Media', 'Western', 'Music', 'Family', 'Comedy'}


In [16]:
# run the following line only once
df_movies = df_movies.append(pd.DataFrame(columns = unique_genres))
df_movies[list(unique_genres)] = 0

for i, row in df_movies.iterrows():
    df_movies.loc[i, df_movies.iloc[i,:]["genres"]] = 1

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort,


### A few quick correlations to set us on a path:

In [17]:
df_movies[['budget', 'vote_average', 'vote_count', 'revenue', 'runtime', 'release_date']].corr()

Unnamed: 0,budget,vote_average,vote_count,revenue,runtime,release_date
budget,1.0,0.073494,0.676642,0.768776,0.134733,0.131675
vote_average,0.073494,1.0,0.123607,0.083868,0.158146,0.026138
vote_count,0.676642,0.123607,1.0,0.812022,0.113539,0.106789
revenue,0.768776,0.083868,0.812022,1.0,0.103917,0.088355
runtime,0.134733,0.158146,0.113539,0.103917,1.0,0.078822
release_date,0.131675,0.026138,0.106789,0.088355,0.078822,1.0


### OK! Our nulls are accounted for, our genres are binarized and the features we are going to use are numbers!  Data cleaning part is over. 

### Now we are ready to start clustering and finally getting to the "Unsupervised" part of this capstone. 

### Let's create our X parameters first. 



In [18]:
#Suspicious about the clustering around certain decades, removing it from this list of X2
X2 = df_movies[['budget',
'release_date',
'revenue',
'runtime',
'vote_average',
'vote_count',]]


### How many clusters is best for us? According to our Elbow Visualizer, 7!

In [19]:
df_movies.isnull().sum()


Action                                       0
Adventure                                    0
Animation                                    0
Aniplex                                      0
BROSTA TV                                    0
Carousel Productions                         0
Comedy                                       0
Crime                                        0
Documentary                                  0
Drama                                        0
Family                                       0
Fantasy                                      0
Foreign                                      0
GoHands                                      0
History                                      0
Horror                                       0
Mardock Scramble Production Committee        0
Music                                        0
Mystery                                      0
Odyssey Media                                0
Pulser Productions                           0
Rogue State  

In [20]:
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

from yellowbrick.cluster import KElbowVisualizer

# Instantiate the clustering model and visualizer
model = KMeans()
visualizer = KElbowVisualizer(model, k=(4,12))

visualizer.fit(X2)        # Fit the data to the visualizer
visualizer.show()        # Finalize and render the figure

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Ok. Now let's also scale our data.

In [None]:
# Perform the necessary imports
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# Create scaler: scaler
scaler = StandardScaler()

# Create KMeans instance: kmeans
kmeans = KMeans(n_clusters=7)

### Creating our clustering model below:

In [None]:
# Import KMeans
from sklearn.cluster import KMeans

model = KMeans(n_clusters=7)

# Fit model to points
model.fit(X2)

# Determine the cluster labels of new_points: labels
labels = model.predict(X2)

# Print cluster labels of new_points
print(labels)

In [None]:
Int_Movies['Label'] = labels 
print (Int_Movies['Label'])

### Let's redefine X2 with our labels included so that we can include it in our visualizations later and call that X3. 



In [None]:
#Suspicious about the clustering around certain decades, removing it from this list of X2
X3 = Int_Movies[['Label', 'budget',
'release_date',
'revenue',
'runtime',
'vote_average',
'vote_count',]]


In [None]:
# Import TSNE
from sklearn.manifold import TSNE

# Create a TSNE instance: model
model = TSNE(learning_rate=200)

# Apply fit_transform to samples: tsne_features
tsne_features = model.fit_transform(X2)

In [None]:
# Select the 0th feature: xs
xs = tsne_features[:,0]

# Select the 1st feature: ys
ys = tsne_features[:,1]

# Scatter plot, coloring by variety_numbers
plt.scatter(xs, ys, c = labels, cmap = 'summer')
plt.show()

In [None]:


plt.figure(figsize=(12,7))
axis = sns.barplot(x= np.arange(0,7,1),y=Int_Movies.groupby(['Label']).count()['budget'].values)
x=axis.set_xlabel("Cluster Number")
x=axis.set_ylabel("Number of movies")

In [None]:
Int_Movies.groupby(['Label']).mean()


### We see that one cluster which is also the smallest, is the cluster of movies that received maximum number of votes(in terms of counts) and also have very high popularity and total runtime and net revenue. Let's see some of the movies that belong to this cluster.

In [None]:
size_array = list(Int_Movies.groupby(['Label']).count()['budget'].values)
size_array


In [None]:
Int_Movies[Int_Movies['Label']==size_array.index(sorted(size_array)[0])].sample(3)


In [None]:
Int_Movies[Int_Movies['Label']==size_array.index(sorted(size_array)[1])].sample(5)



In [None]:
Int_Movies[Int_Movies['Label']==size_array.index(sorted(size_array)[-1])].sample(5)


In [None]:
Int_Movies[Int_Movies['Label']==size_array.index(sorted(size_array)[-2])].sample(5)


In [None]:
Int_Movies[Int_Movies['Label']==size_array.index(sorted(size_array)[2])].sample(5)


In [None]:
plt.scatter(X3['Label'], X3['budget'], c = labels, cmap = 'summer' )
plt.legend(labels)

plt.show()

In [None]:
import seaborn as sns, numpy as np

Western_mask = (X3['Label'] == 1) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['vote_average'], color= 'b')



In [None]:
Western_mask = (X3['Label'] == 1) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['vote_count'], color= 'b')

#change bin size for distribution. bins = np.arrange or lin space to generate some evenly spaced bins. 
#[['Label', 'budget',
#'release_date',
#'revenue',
#'runtime',
#'vote_count',]]


In [None]:
Western_mask = (X3['Label'] == 1) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['runtime'], color= 'b')

#[['Label', 'budget',
#'release_date',
#'revenue',
#'runtime',
#'vote_count',]]

In [None]:
Western_mask = (X3['Label'] == 1) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['revenue'], color= 'b')

#[['Label', 'budget',
#'release_date',
#'revenue',
#'runtime',
#'vote_count',]]

In [None]:
Western_mask = (X3['Label'] == 1) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['release_date'], color= 'b')

#[['Label', 'budget',
#'release_date',
#'revenue',
#'runtime',
#'vote_count',]]

In [None]:
Western_mask = (X3['Label'] == 1) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['budget'], color= 'b')

#[['Label', 'budget',
#'release_date',
#'revenue',
#'runtime',
#'vote_count',]]

In [None]:
import seaborn as sns, numpy as np

Western_mask = (X3['Label'] == 0) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['vote_average'], color= 'g')

In [None]:
Western_mask = (X3['Label'] == 0) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['vote_count'], color= 'g')

#[['Label', 'budget',
#'release_date',
#'revenue',
#'runtime',
#'vote_count',]]

In [None]:
Western_mask = (X3['Label'] == 0) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['runtime'], color= 'g')

#[['Label', 'budget',
#'release_date',
#'revenue',
#'runtime',
#'vote_count',]]

In [None]:
Western_mask = (X3['Label'] == 0) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['revenue'], color= 'g')

#[['Label', 'budget',
#'release_date',
#'revenue',
#'runtime',
#'vote_count',]]

In [None]:
Western_mask = (X3['Label'] == 0) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['release_date'], color= 'g')

#[['Label', 'budget',
#'release_date',
#'revenue',
#'runtime',
#'vote_count',]]

In [None]:
Western_mask = (X3['Label'] == 0) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['budget'], color= 'g')

#[['Label', 'budget',
#'release_date',
#'revenue',
#'runtime',
#'vote_count',]]

In [None]:
import seaborn as sns, numpy as np

Western_mask = (X3['Label'] == 3) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['vote_average'], color= 'y')

In [None]:
import seaborn as sns, numpy as np

Western_mask = (X3['Label'] == 3) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['vote_count'], color= 'y')

#[['Label', 'budget',
#'release_date',
#'revenue',
#'runtime',
#'vote_count',]]

In [None]:
import seaborn as sns, numpy as np

Western_mask = (X3['Label'] == 3) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['runtime'], color= 'y')

#[['Label', 'budget',
#'release_date',
#'revenue',
#'runtime',
#'vote_count',]]

In [None]:
import seaborn as sns, numpy as np

Western_mask = (X3['Label'] == 3) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['revenue'], color= 'y')

#[['Label', 'budget',
#'release_date',
#'revenue',
#'runtime',
#'vote_count',]]

In [None]:
import seaborn as sns, numpy as np

Western_mask = (X3['Label'] == 3) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['release_date'], color= 'y')

#[['Label', 'budget',
#'release_date',
#'revenue',
#'runtime',
#'vote_count',]]

In [None]:
import seaborn as sns, numpy as np

Western_mask = (X3['Label'] == 3) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['budget'], color= 'y')

#[['Label', 'budget',
#'release_date',
#'revenue',
#'runtime',
#'vote_count',]]

In [None]:
import seaborn as sns, numpy as np

Western_mask = (X3['Label'] == 4) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['vote_average'], color= 'r')

#[['Label', 'budget',
#'release_date',
#'revenue',
#'runtime',
#'vote_count',]]

In [None]:

Western_mask = (X3['Label'] == 4) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['vote_count'], color= 'r')

#[['Label', 'budget',
#'release_date',
#'revenue',
#'runtime',
#'vote_count',]]

In [None]:

Western_mask = (X3['Label'] == 4) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['runtime'], color= 'r')

#[['Label', 'budget',
#'release_date',
#'revenue',
#'runtime',
#'vote_count',]]

In [None]:

Western_mask = (X3['Label'] == 4) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['revenue'], color= 'r')

#[['Label', 'budget',
#'release_date',
#'revenue',
#'runtime',
#'vote_count',]]

In [None]:

Western_mask = (X3['Label'] == 4) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['release_date'], color= 'r')

#[['Label', 'budget',
#'release_date',
#'revenue',
#'runtime',
#'vote_count',]]

In [None]:

Western_mask = (X3['Label'] == 4) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['budget'], color= 'r')

#[['Label', 'budget',
#'release_date',
#'revenue',
#'runtime',
#'vote_count',]]

In [None]:
import seaborn as sns, numpy as np

Western_mask = (X3['Label'] == 5) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['vote_average'], color= 'c')

In [None]:
Western_mask = (X3['Label'] == 5) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['vote_count'], color= 'c')

#[['Label', 'budget',
#'release_date',
#'revenue',
#'runtime',
#'vote_count',]]

In [None]:
Western_mask = (X3['Label'] == 5) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['runtime'], color= 'c')

#[['Label', 'budget',
#'release_date',
#'revenue',
#'runtime',
#'vote_count',]]

In [None]:
Western_mask = (X3['Label'] == 5) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['revenue'], color= 'c')

#[['Label', 'budget',
#'release_date',
#'revenue',
#'runtime',
#'vote_count',]]

In [None]:
Western_mask = (X3['Label'] == 5) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['release_date'], color= 'c')

#[['Label', 'budget',
#'release_date',
#'revenue',
#'runtime',
#'vote_count',]]

In [None]:
Western_mask = (X3['Label'] == 5) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['budget'], color= 'c')

#[['Label', 'budget',
#'release_date',
#'revenue',
#'runtime',
#'vote_count',]]

In [None]:
import seaborn as sns, numpy as np

Western_mask = (X3['Label'] == 6) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['vote_average'], color= 'k')

In [None]:
Western_mask = (X3['Label'] == 6) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['vote_count'], color= 'k')

#[['Label', 'budget',
#'release_date',
#'revenue',
#'runtime',
#'vote_count',]]

In [None]:
Western_mask = (X3['Label'] == 6) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['runtime'], color= 'k')

#[['Label', 'budget',
#'release_date',
#'revenue',
#'runtime',
#'vote_count',]]

In [None]:
Western_mask = (X3['Label'] == 6) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['revenue'], color= 'k')

#[['Label', 'budget',
#'release_date',
#'revenue',
#'runtime',
#'vote_count',]]

In [None]:
Western_mask = (X3['Label'] == 6) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['vote_average'], color= 'k')

#[['Label', 'budget',
#'release_date',
#'revenue',
#'runtime',
#'vote_count',]]

In [None]:
Western_mask = (X3['Label'] == 6) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['release_date'], color= 'k')

#[['Label', 'budget',
#'release_date',
#'revenue',
#'runtime',
#'vote_count',]]

In [None]:
Western_mask = (X3['Label'] == 6) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['budget'], color= 'k')

#[['Label', 'budget',
#'release_date',
#'revenue',
#'runtime',
#'vote_count',]]

In [None]:
import seaborn as sns, numpy as np

Western_mask = (X3['Label'] == 2) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['vote_average'], color= 'm')

In [None]:
Western_mask = (X3['Label'] == 2) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['vote_count'], color= 'm')

#[['Label', 'budget',
#'release_date',
#'revenue',
#'runtime',
#'vote_count',]]

In [None]:
Western_mask = (X3['Label'] == 2) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['runtime'], color= 'm')

#[['Label', 'budget',
#'release_date',
#'revenue',
#'runtime',
#'vote_count',]]

In [None]:
Western_mask = (X3['Label'] == 2) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['revenue'], color= 'm')

#[['Label', 'budget',
#'release_date',
#'revenue',
#'runtime',
#'vote_count',]]

In [None]:
Western_mask = (X3['Label'] == 2) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['release_date'], color= 'm')

#[['Label', 'budget',
#'release_date',
#'revenue',
#'runtime',
#'vote_count',]]

In [None]:
Western_mask = (X3['Label'] == 2) #&(X['Action'] == 1) #&(X3['Western'] == 1) &
ax = sns.distplot(X3[Western_mask]['budget'], color= 'm')

#[['Label', 'budget',
#'release_date',
#'revenue',
#'runtime',
#'vote_count',]]