### Jonathan Bunch

17 October 2021

Bellevue University

DSC550-T301

---

# Final Project Milestone Two

Based on the feedback from milestone one, I will look for patterns in the features within my original Disney movie
profits dataset.  I may revisit additional datasets later, but for now I will attempt to model movie profits based on
the other features present in this dataset: realease date, genre, and MPAA rating.

In [2]:
import pandas as pd
import numpy as np

# Import my dataset.
disney_raw = pd.read_csv("disney_movies_total_gross.csv")

In [3]:
# Take a look at the raw dataset.
disney_raw.head()

Unnamed: 0,movie_title,release_date,genre,mpaa_rating,total_gross,inflation_adjusted_gross
0,Snow White and the Seven Dwarfs,1937-12-21,Musical,G,184925485,5228953251
1,Pinocchio,1940-02-09,Adventure,G,84300000,2188229052
2,Fantasia,1940-11-13,Musical,G,83320000,2187090808
3,Song of the South,1946-11-12,Adventure,G,65000000,1078510579
4,Cinderella,1950-02-15,Drama,G,85000000,920608730


Looking at the raw dataset, there are several features that I would like to remove and/or simplify.  For starters, the
movie_title feature is probably not useful for this analysis. It seems strange to remove what I would consider one of
the most important features for human interpretability, but it would probably only cause confusion in the context of
preparing these data for machine learning algorithms.

In [4]:
# Make a copy of the dataset to work with.
ddf1 = disney_raw.copy()
# Drop the movie_title feature.
ddf1 = ddf1.drop(columns='movie_title')
# Check the results.
ddf1.head()

Unnamed: 0,release_date,genre,mpaa_rating,total_gross,inflation_adjusted_gross
0,1937-12-21,Musical,G,184925485,5228953251
1,1940-02-09,Adventure,G,84300000,2188229052
2,1940-11-13,Musical,G,83320000,2187090808
3,1946-11-12,Adventure,G,65000000,1078510579
4,1950-02-15,Drama,G,85000000,920608730


The other feature I want to remove is "total_gross".  This feature represents the dollar amount of the gross profits,
without any kind of adjustment for inflation. I beleive using non-adjusted profits from such a large span of time could
create a false correlation between time and profit, as well as potentially introducing other complications. Luckily,
the data set already includes and adjusted profit feature, which shoud alleviate many of these issues.

In [5]:
# Drop the non-adjusted profits feature.
ddf1 = ddf1.drop(columns='total_gross')
# Check the results.
ddf1.head()

Unnamed: 0,release_date,genre,mpaa_rating,inflation_adjusted_gross
0,1937-12-21,Musical,G,5228953251
1,1940-02-09,Adventure,G,2188229052
2,1940-11-13,Musical,G,2187090808
3,1946-11-12,Adventure,G,1078510579
4,1950-02-15,Drama,G,920608730


Next, I want to work on the release_date feature.  The date is relevant to my analysis, but I only plan to use the
year portion.  I will extract the year and drop the month-day portion.

In [6]:
# I will start by converting the release_date feature to a datetime data type.
ddf1.release_date = ddf1.release_date.astype(np.datetime64)
# Now we can easily extract the year portion and assign it to a new column.
ddf1['year'] = ddf1.release_date.apply(lambda x: x.year)
# Finally, we can drop the original date column.
ddf1 = ddf1.drop(columns='release_date')
# Check the results.
ddf1.head()

Unnamed: 0,genre,mpaa_rating,inflation_adjusted_gross,year
0,Musical,G,5228953251,1937
1,Adventure,G,2188229052,1940
2,Musical,G,2187090808,1940
3,Adventure,G,1078510579,1946
4,Drama,G,920608730,1950


Next, I will address the missing values in some of my features. I also noticed some zero values in the adjusted
profits feature, which, in this context, is essentially the same as a missing value.

In [7]:
ddf1.isna().sum()

genre                       17
mpaa_rating                 56
inflation_adjusted_gross     0
year                         0
dtype: int64

In [8]:
sum(ddf1.inflation_adjusted_gross == 0)

4

It looks like we are missing relatively few values in the profit and genre feature, but more than a few in the
mpaa_rating feature.  The observations that are missing the profit data are not useful and should be dropped.

In [9]:
ddf1 = ddf1[ddf1.inflation_adjusted_gross != 0]

I think that any observations that are missing both the genre and rating will not contain enough information
to be useful, so I will drop those as well.

In [10]:
# Drop observations that are missing values for all of the specified subset of features.
ddf1 = ddf1.dropna(subset=['genre', 'mpaa_rating'], how='all')
# Check the missing values again.
ddf1.isna().sum()

genre                       10
mpaa_rating                 47
inflation_adjusted_gross     0
year                         0
dtype: int64

Unfortunately, we still have 57 observations that are missing either the genre or the rating value. In this context
it would probably not make much sense to attempt to calculate a fill value, since these values are presumably not
related to the values of any of the other features aside from the target. Therefore, I will drop these observations.

In [11]:
ddf1 = ddf1.dropna()
# Check the results.
ddf1.isna().sum()

genre                       0
mpaa_rating                 0
inflation_adjusted_gross    0
year                        0
dtype: int64

Now that we have only the features of interest, and NA values have been addressed, I will create dummy variables for
the genre and rating features.

In [14]:
ddf2 = pd.get_dummies(data=ddf1, columns=['genre', 'mpaa_rating'])
# Check the results.
ddf2.head()

Unnamed: 0,inflation_adjusted_gross,year,genre_Action,genre_Adventure,genre_Black Comedy,genre_Comedy,genre_Concert/Performance,genre_Documentary,genre_Drama,genre_Horror,genre_Musical,genre_Romantic Comedy,genre_Thriller/Suspense,genre_Western,mpaa_rating_G,mpaa_rating_Not Rated,mpaa_rating_PG,mpaa_rating_PG-13,mpaa_rating_R
0,5228953251,1937,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0
1,2188229052,1940,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
2,2187090808,1940,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0
3,1078510579,1946,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
4,920608730,1950,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0


Now we are left with all numerical features representing the adjusted gross profits and release year, and binary
features for each genre and MPAA rating.