# Decision Tree Lab

### Analyzing IMBD Movie Ratings

Congrats! You just graduated UVA's BSDS program and got a job working at a movie studio in Hollywood. 

Your boss is the head of the studio and wants to know if they can gain a competitive advantage by predicting new movies that might get high imdb scores (movie rating). 

You would like to be able to explain the model to mere mortals but need a fairly robust and flexible approach so you've chosen to use decision trees to get started. 

In doing so, similar to  great data scientists of the past you remembered the excellent education provided to you at UVA in a undergrad data science course and have outline 20ish steps that will need to be undertaken to complete this task. As always, you will need to make sure to #comment your work heavily. 

 Footnotes: 
-	You can add or combine steps if needed
-	Also, remember to try several methods during evaluation and always be mindful of how the model will be used in practice.


- Make sure all your variables are the correct type (factor, character,numeric, etc.)
- 19 What are the top 3 movies based on the tune set? Which variables are most important in predicting the top 3 movies?
- Use three highest probabilities for falling into the threshold

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, RepeatedStratifiedKFold, GridSearchCV
from sklearn.metrics import accuracy_score, roc_auc_score, recall_score, precision_score, f1_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder

1. Load the Data

In [3]:
# Load the data
movie_data_raw = pd.read_csv("../data/movie_metadata.csv")

2. Ensure all the variables are classified correctly including the target variable and collapse factor variables as needed.

### Target Variable
IMDB Ratings greater than 8.0.

In [23]:
# Data Prep 

movie_data = pd.read_csv("../data/movie_metadata.csv")

# Collapse movie directors variable
top10_director = movie_data.groupby(by='director_name').size().sort_values(ascending=False).head(10)
movie_data["director_name"] = movie_data.director_name.apply(lambda x: "Top 10" if x in top10_director else "Other").astype("category")
movie_data = movie_data.rename(columns={'director_name': 'director'})

# Collapse actor_1 variable
top30_actor = movie_data.groupby(by='actor_1_name').size().sort_values(ascending=False).head(30)
movie_data["actor_1_name"] = movie_data.actor_1_name.apply(lambda x: "Top 30" if x in top30_actor else "Other").astype("category")
movie_data = movie_data.rename(columns={'actor_1_name': 'lead_actor'})

# Adjust the variable for genres
#genre = ["Drama", "Comedy", "Action", "Horror", "Fantasy", "Documentary", "Crime", "Adventure", "Animation", "Biography"]
#movie_data['genres'] = movie_data['genres'].str.split("|")
#movie_data['genres'] = movie_data.genres.apply(lambda x: x[0] if x[0] in genre else "Other").astype("category")

# Collapse genre variable
movie_data['genres'] = movie_data['genres'].str.split("|").apply(lambda x: x[0])
top3_genre = movie_data.groupby(by='genres').size().sort_values(ascending=False).head(3)
movie_data["genres"] = movie_data.genres.apply(lambda x: "Top 3" if x in top3_genre else "Other").astype("category")

# Collapse language variable
movie_data['language'] = movie_data.language.apply(lambda x: "English" if x=="English" else "Other").astype("category")

# Collapse country variable
movie_data['country'] = movie_data.country.apply(lambda x: "USA" if x=="USA" else "Other").astype("category")

# Collapse rating variable
ratings = ["PG", "PG-13", "R"]
movie_data['content_rating'] = movie_data.content_rating.apply(lambda x: x if x in ratings else "Other").astype("category")

# Drop unnecessary features
drop_cols = ['color', 'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name', 'actor_1_facebook_likes', 'facenumber_in_poster', 'cast_total_facebook_likes', 'movie_title', 'actor_3_name', 'plot_keywords', 'movie_imdb_link', 'actor_2_facebook_likes', 'aspect_ratio', 'movie_facebook_likes']
movie_data = movie_data.drop(columns=drop_cols)

# Encode categorical variables into numeric
category_cols = list(movie_data.select_dtypes('category'))
movie_data = pd.get_dummies(movie_data, columns=category_cols, drop_first=True)

# Classify target variable
imdb_score_target = 8.0
movie_data['imdb_score'] = movie_data.imdb_score.apply(lambda x: 1 if x >= imdb_score_target else 0).astype('category')


#movie_data.info()


3. Check for missing variables and correct as needed. Once you've completed the cleaning again create a function that will do this for you in the future. In the submission, include only the function and the function call.

In [24]:
# Drop NA Rows
#movie_data.isna().sum()
movie_data = movie_data.dropna()

In [23]:
# Data Preparation Function

# Load in movie data
movie_data_raw = pd.read_csv("../data/movie_metadata.csv")

def prepare_movie_data(movie_df):
    # Drop NA Rows
    movie_df = movie_df.dropna()

    # Collapse movie directors variable
    top10_director = movie_df.groupby(by='director_name').size().sort_values(ascending=False).head(10)
    movie_df["director_name"] = movie_df.director_name.apply(lambda x: 1 if x in top10_director else 0).astype("category")
    movie_df = movie_df.rename(columns={'director_name': 'top_director'})

    # Collapse actor_1 variable
    top30_actor = movie_df.groupby(by='actor_1_name').size().sort_values(ascending=False).head(30)
    movie_df["actor_1_name"] = movie_df.actor_1_name.apply(lambda x: 1 if x in top30_actor else 0).astype("category")
    movie_df = movie_df.rename(columns={'actor_1_name': 'top_lead_actor'})

    # Adjust the variable for genres
    genre = ["Drama", "Comedy", "Action", "Horror", "Fantasy", "Documentary", "Crime", "Adventure", "Animation", "Biography"]
    movie_df['genres'] = movie_df['genres'].str.split("|")
    movie_df['genres'] = movie_df.genres.apply(lambda x: x[0] if x[0] in genre else "Other").astype("category")

    # Collapse genre variable
    #movie_df['genres'] = movie_df['genres'].str.split("|").apply(lambda x: x[0])
    #top3_genre = movie_df.groupby(by='genres').size().sort_values(ascending=False).head(3)
    #movie_df["genres"] = movie_df.genres.apply(lambda x: "Top 3" if x in top3_genre else "Other").astype("category")

    # Collapse language variable
    movie_df['language'] = movie_df.language.apply(lambda x: 1 if x=="English" else 0).astype("category")
    movie_df = movie_df.rename(columns={'language': 'language_english'})

    # Collapse country variable
    movie_df['country'] = movie_df.country.apply(lambda x: 1 if x=="USA" else 0).astype("category")
    movie_df = movie_df.rename(columns={'country': 'country_usa'})

    # Collapse rating variable
    ratings = ["PG", "PG-13", "R"]
    movie_df['content_rating'] = movie_df.content_rating.apply(lambda x: x if x in ratings else "Other").astype("category")

    # Drop unnecessary features
    drop_cols = ['color', 'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name', 'actor_1_facebook_likes', 'facenumber_in_poster', 'cast_total_facebook_likes', 'movie_title', 'actor_3_name', 'plot_keywords', 'movie_imdb_link', 'actor_2_facebook_likes', 'aspect_ratio', 'movie_facebook_likes']
    movie_df = movie_df.drop(columns=drop_cols)

    # Label Encoding
    le_columns = ['content_rating', 'genres']
    label_encoder = LabelEncoder()
    movie_df[le_columns[0]] = label_encoder.fit_transform(movie_df[le_columns[0]])
    movie_df[le_columns[1]] = label_encoder.fit_transform(movie_df[le_columns[1]])

    # Encode categorical variables into numeric
    #category_cols = list(movie_df.select_dtypes('category'))
    #movie_df = pd.get_dummies(movie_df, columns=category_cols, drop_first=True)

    # Classify target variable
    imdb_score_target = 8.0
    movie_df['imdb_score'] = movie_df.imdb_score.apply(lambda x: 1 if x >= imdb_score_target else 0).astype('category')

    return movie_df


movie_data = prepare_movie_data(movie_data_raw)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movie_df["director_name"] = movie_df.director_name.apply(lambda x: 1 if x in top10_director else 0).astype("category")


4. Guess what, you don't need to scale the data, because DTs don't require this to be done, they make local greedy decisions...keeps getting easier, go to the next step.

5. Determine the baserate or prevalence for the classifier, what does this number mean?

The baserate/prevalence of the classifier is the the rate of target variables that are the positive class. In this case, it is the proportion of movies that have an imdb score greater than or equal to 8.0.

In [26]:
# Calculate prevalence of classifier

prevalence = movie_data.imdb_score.value_counts()[1]/len(movie_data)
print(prevalence)

0.05619174434087883


6. Split your data into test, tune, and train. (80/10/10)

In [4]:
# Separate features and target variable
x = movie_data.drop(columns='imdb_score')
y = movie_data['imdb_score']

# Split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, stratify=y, random_state=45)

# Split into test and tune sets
x_tune, x_test, y_tune, y_test = train_test_split(x_test, y_test, test_size=0.5, stratify=y_test, random_state=45)


7. Create the kfold object for cross validation.

In [5]:
# Instantiate KFold Object

kf = RepeatedStratifiedKFold(n_splits=5, n_repeats=4, random_state=45)

8. Create the scoring metric you will use to evaluate your model and the max depth hyperparameter (grid search)

In [6]:
# Define scoring metrics to evaluate the model
scoring_metrics = {'accuracy': 'accuracy',
                   'recall': 'recall',
                   'precision': 'precision',
                   'f1': 'f1', 
                   'roc_auc': 'roc_auc'}

# Create the max depth hyperparameter
max_depth_param = {'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]}

9. Build the classifier object

In [7]:
# Instantiate Decision Tree Classifier object

dt = DecisionTreeClassifier(random_state=45)

10. Use the kfold object and the scoring metric to find the best hyperparameter value for max depth via the grid search method.

In [8]:
# Implement Grid Search on the Decision Tree Classifier
search = GridSearchCV(dt, max_depth_param, scoring=scoring_metrics, n_jobs=1, refit='f1', cv=kf)

11. Fit the model to the training data.

In [11]:
# Execute the search on the training data
dt_model = search.fit(x_train, y_train)

#dt_model.cv_results_

12. What is the best depth value?

In [10]:
# Evaluate the best depth value
best_depth = dt_model.best_estimator_

print(best_depth)

DecisionTreeClassifier(max_depth=7, random_state=45)


13. Print out the model

In [44]:
# Print model
print(dt_model)
print(dt_model.cv_results_)
dt_model

GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=4, n_splits=5, random_state=45),
             estimator=DecisionTreeClassifier(random_state=45), n_jobs=1,
             param_grid={'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]},
             refit='f1',
             scoring={'accuracy': 'accuracy', 'f1': 'f1',
                      'precision': 'precision', 'recall': 'recall',
                      'roc_auc': 'roc_auc'})
{'mean_fit_time': array([0.00224906, 0.00303868, 0.00389482, 0.0048978 , 0.00589439,
       0.0068891 , 0.0075983 , 0.00833124, 0.00923765, 0.01059617,
       0.01062615, 0.01047565]), 'std_fit_time': array([6.02556176e-04, 1.19521930e-04, 7.98080143e-05, 1.35054061e-04,
       1.03337754e-04, 1.22215931e-04, 1.02469865e-04, 1.09806158e-04,
       2.48323652e-04, 3.36301570e-03, 5.85654788e-04, 8.28670247e-04]), 'mean_score_time': array([0.00817362, 0.00360783, 0.00342817, 0.00354506, 0.00364507,
       0.00371563, 0.00343467, 0.00347562, 0.00350188, 0.00391423

14. View the results, comment on how the model performed using the metrics you selected.



In [None]:
# Data Preparation Function

# Load in movie data
movie_data_raw = pd.read_csv("../data/movie_metadata.csv")

def prepare_movie_data(movie_df):
    # Drop NA Rows
    movie_df = movie_df.dropna()

    # Collapse movie directors variable
    top10_director = movie_df.groupby(by='director_name').size().sort_values(ascending=False).head(10)
    movie_df["director_name"] = movie_df.director_name.apply(lambda x: 1 if x in top10_director else 0).astype("category")
    movie_df = movie_df.rename(columns={'director_name': 'top_director'})

    # Collapse actor_1 variable
    top30_actor = movie_df.groupby(by='actor_1_name').size().sort_values(ascending=False).head(30)
    movie_df["actor_1_name"] = movie_df.actor_1_name.apply(lambda x: 1 if x in top30_actor else 0).astype("category")
    movie_df = movie_df.rename(columns={'actor_1_name': 'top_lead_actor'})

    # Adjust the variable for genres
    genre = ["Drama", "Comedy", "Action", "Horror", "Fantasy", "Documentary", "Crime", "Adventure", "Animation", "Biography"]
    movie_df['genres'] = movie_df['genres'].str.split("|")
    movie_df['genres'] = movie_df.genres.apply(lambda x: x[0] if x[0] in genre else "Other").astype("category")

    # Collapse genre variable
    #movie_df['genres'] = movie_df['genres'].str.split("|").apply(lambda x: x[0])
    #top3_genre = movie_df.groupby(by='genres').size().sort_values(ascending=False).head(3)
    #movie_df["genres"] = movie_df.genres.apply(lambda x: "Top 3" if x in top3_genre else "Other").astype("category")

    # Collapse language variable
    movie_df['language'] = movie_df.language.apply(lambda x: 1 if x=="English" else 0).astype("category")
    movie_df = movie_df.rename(columns={'language': 'language_english'})

    # Collapse country variable
    movie_df['country'] = movie_df.country.apply(lambda x: 1 if x=="USA" else 0).astype("category")
    movie_df = movie_df.rename(columns={'country': 'country_usa'})

    # Collapse rating variable
    ratings = ["PG", "PG-13", "R"]
    movie_df['content_rating'] = movie_df.content_rating.apply(lambda x: x if x in ratings else "Other").astype("category")

    # Drop unnecessary features
    drop_cols = ['color', 'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name', 'actor_1_facebook_likes', 'facenumber_in_poster', 'cast_total_facebook_likes', 'movie_title', 'actor_3_name', 'plot_keywords', 'movie_imdb_link', 'actor_2_facebook_likes', 'aspect_ratio', 'movie_facebook_likes']
    movie_df = movie_df.drop(columns=drop_cols)

    # Label Encoding
    le_columns = ['content_rating', 'genres']
    label_encoder = LabelEncoder()
    movie_df[le_columns[0]] = label_encoder.fit_transform(movie_df[le_columns[0]])
    movie_df[le_columns[1]] = label_encoder.fit_transform(movie_df[le_columns[1]])

    # Encode categorical variables into numeric
    #category_cols = list(movie_df.select_dtypes('category'))
    #movie_df = pd.get_dummies(movie_df, columns=category_cols, drop_first=True)

    # Classify target variable
    imdb_score_target = 8.0
    movie_df['imdb_score'] = movie_df.imdb_score.apply(lambda x: 1 if x >= imdb_score_target else 0).astype('category')

    return movie_df


movie_data = prepare_movie_data(movie_data_raw)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movie_df["director_name"] = movie_df.director_name.apply(lambda x: 1 if x in top10_director else 0).astype("category")


<class 'pandas.core.frame.DataFrame'>
Index: 3755 entries, 0 to 5042
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   top_director            3755 non-null   category
 1   num_critic_for_reviews  3755 non-null   float64 
 2   duration                3755 non-null   float64 
 3   gross                   3755 non-null   float64 
 4   genres                  3755 non-null   int64   
 5   top_lead_actor          3755 non-null   category
 6   num_voted_users         3755 non-null   int64   
 7   num_user_for_reviews    3755 non-null   float64 
 8   language_english        3755 non-null   category
 9   country_usa             3755 non-null   category
 10  content_rating          3755 non-null   int64   
 11  budget                  3755 non-null   float64 
 12  title_year              3755 non-null   float64 
 13  imdb_score              3755 non-null   category
dtypes: category(5), float64(6), i

In [None]:
# Data Preparation Function

# Load in movie data
movie_data_raw = pd.read_csv("../data/movie_metadata.csv")

def prepare_movie_data(movie_df):
    # Drop NA Rows
    movie_df = movie_df.dropna()

    # Collapse movie directors variable
    top10_director = movie_df.groupby(by='director_name').size().sort_values(ascending=False).head(10)
    movie_df["director_name"] = movie_df.director_name.apply(lambda x: "Top 10" if x in top10_director else "Other").astype("category")
    movie_df = movie_df.rename(columns={'director_name': 'director'})

    # Collapse actor_1 variable
    top30_actor = movie_df.groupby(by='actor_1_name').size().sort_values(ascending=False).head(30)
    movie_df["actor_1_name"] = movie_df.actor_1_name.apply(lambda x: "Top 30" if x in top30_actor else "Other").astype("category")
    movie_df = movie_df.rename(columns={'actor_1_name': 'lead_actor'})

    # Adjust the variable for genres
    #genre = ["Drama", "Comedy", "Action", "Horror", "Fantasy", "Documentary", "Crime", "Adventure", "Animation", "Biography"]
    #movie_df['genres'] = movie_df['genres'].str.split("|")
    #movie_df['genres'] = movie_df.genres.apply(lambda x: x[0] if x[0] in genre else "Other").astype("category")

    # Collapse genre variable
    movie_df['genres'] = movie_df['genres'].str.split("|").apply(lambda x: x[0])
    top3_genre = movie_df.groupby(by='genres').size().sort_values(ascending=False).head(3)
    movie_df["genres"] = movie_df.genres.apply(lambda x: "Top 3" if x in top3_genre else "Other").astype("category")

    # Collapse language variable
    movie_df['language'] = movie_df.language.apply(lambda x: "English" if x=="English" else "Other").astype("category")

    # Collapse country variable
    movie_df['country'] = movie_df.country.apply(lambda x: "USA" if x=="USA" else "Other").astype("category")

    # Collapse rating variable
    ratings = ["PG", "PG-13", "R"]
    movie_df['content_rating'] = movie_df.content_rating.apply(lambda x: x if x in ratings else "Other").astype("category")

    # Drop unnecessary features
    drop_cols = ['color', 'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name', 'actor_1_facebook_likes', 'facenumber_in_poster', 'cast_total_facebook_likes', 'movie_title', 'actor_3_name', 'plot_keywords', 'movie_imdb_link', 'actor_2_facebook_likes', 'aspect_ratio', 'movie_facebook_likes']
    movie_df = movie_df.drop(columns=drop_cols)

    # Encode categorical variables into numeric
    category_cols = list(movie_df.select_dtypes('category'))
    movie_df = pd.get_dummies(movie_df, columns=category_cols, drop_first=True)

    # Label Encoding
    #le_columns = ['content_rating', 'genres']
    #label_encoder = LabelEncoder()
    #movie_df[le_columns[0]] = label_encoder.fit_transform(movie_df[le_columns[0]])
    #movie_df[le_columns[1]] = label_encoder.fit_transform(movie_df[le_columns[1]])
    
    # Classify target variable
    imdb_score_target = 8.0
    movie_df['imdb_score'] = movie_df.imdb_score.apply(lambda x: 1 if x >= imdb_score_target else 0).astype('category')

    return movie_df