# Group 1 - Data Mining and Data Science project

## 1. Background

*Describe the organization and its operations, operational environment, analytical needs, etc.*

The organization chosen for this project is a fictional movie production company in the film industry that is creating and directing new movies. The typical working process for the company is as follow. They choose a work, or an idea, and get the full rights to it. Then they put the crew together, consisting of writer, director and actors. When all this is done they reach out to investors to get the money needed to produce the film. Then they create the film and sell it to theatres and resellers.

In recent years, the company has not used any specific strategy for movie production. They have tried to guess what kind of movies people like and from that invested huge financial capital in the production. Then they have put the crew together without making any major analysis in advance. This strategy has only managed to produce a few successful movies, but the majority of all produced films are at a loss. The company has repeatedly tried to find a pattern for those movies that are profitable and sell well but has not succeeded yet. Eventually, they came to a point where they realized that their current strategy is not working well enough and if they continue as they do, they could be bankrupt in the near future. The next movie can be crucial for the company's survival.

One of the factors that may increase the chance of a movie to be successful is that it has high ratings on IMDb. Many film companies use IMDB to find best-selling movies, therefore, high rated movies have a greater chance of generating profits (Meenakshi et al., 2018). Therefore, the company's primary goal is to get the best possible rating on IMDb. 


## 2. Problem description

*Describe the problem that the organization is facing and the research question it needs answered.*

### 2.1 Problem description

The company has released several movies but none of them has been particularly successful. For their next movie, the aim is to make a bestseller, or at least more profitable, to get the company back on track again. This time they want to make an analysis of what factors they should take into account before starting the creation process of the movie. They want to find out what is most important to allocate their financial resources too, to optimize and improve the success rate of the film. This in turn to increases the monetary results of the film and reach their goal. 

It is difficult to say what aspects of a movie that are the most crucial factors of defining a movie as good (Oliver and Hartman, 2010). In addition to this, as Sharda and Delen (2006) explain that it is also very difficult to predict the demand for a movie in advance. Therefore this makes the movie business one of the riskiest endeavours for investors. There are even people claiming that it is not possible to predict how a movie is going to do in the marketplace (Sharda and Delen, 2006). As Jack Valenti, the former president of Motion Picture Association of America once said: “Excellence is a fragile substance and movie making is a collaboration of talent, which is why it is hard to make and buy great films.”  (Valenti, 1978). He is entitled that talent is required for making a good movie, as movies are subject to subjective judgment by the viewers. However, with the help of new technologies and methods, it could be possible to “hack” the normal way of producing movies to increase the chances of being successful.

This project will look into if it is possible to do an analysis of the quantitative data of movies available at IMDb which are; the length of the movie, actors, writer,  director and genre. Then based on these, see if there are any factors that can increase the chances of producing a highly rated movie. One approach to this is data mining. According to Wu (et. al., 2014), Big Data can be used to explore a large amount of data and extract useful information. By using data mining on ratings of movies, and data of the movie, from the well-known website IMDb, some useful information hopefully will be extracted and used to help the company what they need to improve for the next movie.

### 2.2 Research questions
	
Based on the above this report will answer the following questions:
Which are the most common factors for success among highly rated movies at IMDb? And which could be used or should not be used for the company’s next movie?

### 2.3 Hypothesis

To answer the question being asked, this research will look at factors that are believed to play a great role if the movie will get a high grade from the people watching it or not. The hypothesis, based on the data available through IMDb, is that the movie’s running time, the actors in it, the writer, the director and the genre are the factors that could have a major impact on the result. 

**_Running time_**

It is not certain if previous researches have included a movie’s running time as a success factor for a movie. However, this could be interesting to look at. It is not impossible to think that the length has some impact on the viewers. Perhaps a too long movie will make the audience lose their concentration, or if it is too short they do not have time to fully get into the movie. 

**_Writers_**

The writers  often, in comparison to actors and directors,  tend to be quite anonymous in relation to their films. And as Batty (2015) describes, there is a lack of screen production research which means that this is an area that may contain gaps. What is also important to baere in mind is that it is the writer who is the one who creates the basic material of the movie. Therefore in this analysis it should not matter who the writer of the movie is because the result is what is going to be used by the writer.  Nevertheless this is an interesting factor to look at in relation to the movie’s success rate, but in the end it should not significantly affect our result.

**_Actors_**

When others have tried to analyse the factors to a movie’s success many of them have looked into the impact of the stars of the movie (Lee et. al, 2016).That is, the actors. Lee et. al (2014) are using a measurement of “star buzz” in their analysis of a successful movie. Since many other movie researches are using this as a factor, and even though it has shown a mixed result, it seems like a factor that should be considered. It is also not difficult to imagine that a movie with a popular actor with many fans gets high rated just because the actor is acting in it. 

**_Director_**

The director of the movie is likely to have an impact on the success rate. Parkeh and Biswas (2015) are analysing the factors that have most impact on movies in certain genres. Their results shows that in all of the six genres, direction is one of the driving factors. Therefore this factor will be used in this report as well. The difference in this research is that the aim is to find the names of the directors that often create successful movies. 

**_Genre_**

The genre is used in this analysis since it is commonly used in other researches like this one (Lee et. al, 2014). It could be interesting to see if there are any specific genres that are extra popular now. 

### 2.3 Data Mining Goals and Success Criteria 

The goal with this data mining project is to have created a model showing the factors that will make a successful movie. 

Success criteria:
A pattern can be interpreted.
The prediction is somewhat credible.



## 3. Data collection

*Document your data collection process and the properties of the data here. Implement, using Python code, to load and preprocess your selected dataset.*

### 3.1 Data Collection

The data used for the analysis will be user rating data from IMDb. This data is free and can be obtained from IMDB (imdb.com, 2018) with its documentation available at https://www.imdb.com/interfaces/. The datasets contains complete user rating data from IMDb and in addition to the rating it contains additional information about the movie titles and actors. The table about the directors and writers for every movie will be left out due to reducing the scope of the project. The information selected is considered relevant enough to get useful results for the movie company.

MDb is used as the primary source for this project because of the following reasons: IMDB provides a large movie set that is open to all users, it has information related to movie ratings, movie genres, actors, directors and writers.
Movies on IMDB are exposed to a large number of people. At present, IMDB has about 83 million registered users (IMDB.com). IMDB does not have a complete dataset, it lacks information about movie budget, revenue other important factors. But the information available is considered sufficient for the purpose of this project.

### 3.2 Datasets/Tables

For the analysis four different datasets will be downloaded from IMDb, these are: “Title Basics”, “Title Ratings”, “Name Basics” and “Title Principals”. 

In [44]:
# Run this cell to import the modules and set up some stuff
import pandas as pd
import matplotlib
%matplotlib inline
matplotlib.pyplot.rcParams['figure.figsize'] = [10, 6]

- “Title Basics” contains basic information related to the movie title and the columns that will be used from this table are “tconst”, “titleType”, “tvepisode”, “primaryTitle”, “startYear”, “runtimeMinutes” and “genres”.

In [45]:
title_basics = pd.read_csv(
    "https://datasets.imdbws.com/title.basics.tsv.gz", 
    encoding="utf-8", sep="\t", 
    dtype={'tconst': str, 'titleType': str, 'primaryTitle': str, 'originalTitle': str, 'isAdult': int, 'startYear': str, 'endYear': str, 'runtimeMinutes': str, 'genres': str}
)

- “Name Basics ” contains information about all people working with each movie, but only information about actors will be used. The columns that will be used are “nconst”, “primaryName” and “knownForTitles”.

In [46]:
name_basics = pd.read_csv(
    "https://datasets.imdbws.com/name.basics.tsv.gz", 
    encoding="utf-8", sep="\t", 
    dtype={'nconst': str, 'primaryName': str, 'birthYear': str, 'deathYear': str, 'primaryProfession': str, 'knownForTitles': str}
)

- “Title Ratings” contains the actual information about the ratings for each title and the columns that will be used for this table are “tconst”, “averageRating” and “numVotes”.

In [47]:
title_ratings = pd.read_csv(
    "https://datasets.imdbws.com/title.ratings.tsv.gz", 
    encoding="utf-8", sep="\t", 
    dtype={'tconst': str, 'averageRating': float, 'numVotes': int}
)

- “Title Principals” contains information about the principal people related to every movie. The columns that will be used are “tconst”, “ordering”, “nconst” and “category”.

In [48]:
title_principals = pd.read_csv(
    "https://datasets.imdbws.com/title.principals.tsv.gz", 
    encoding="utf-8", sep="\t", 
    dtype={'tconst': str, 'ordering': int, 'nconst': str, 'category': str, 'job': str, 'characters': str}
)

- “Title crew” contains information about the writers and directors of a movie. The columns that will be used are “tconst”, “directors” and “writers”.

In [49]:
title_crew = pd.read_csv(
    "https://datasets.imdbws.com/title.crew.tsv.gz", 
    encoding="utf-8", sep="\t", 
    dtype={'tconst': str, 'directors': str, 'writers': str}
)

The table below is visualization of the datasets and which columns that will be used and which that will be filtered out. The cells marked in green will be used and the others will be left out. The constants are used for joining the tables together.

<img src="Table of tables.png">

## 3.3 Preprocessing

### 3.3.1 Initial filtering

We filter out only the relevant columns in each table below:

In [50]:
name_basics = name_basics[['nconst', 'primaryName', 'knownForTitles']]
title_basics = title_basics[['tconst', 'titleType', 'primaryTitle', 
                             'startYear', 'runtimeMinutes', 'genres']]
title_principals = title_principals[['tconst', 'ordering', 'nconst', 'category']]

We filter out only the movies from the year 2000 and until 2017:

In [51]:
title_basics_filtered = title_basics.loc[title_basics['titleType'] == "movie"]
title_basics_filtered = title_basics_filtered.loc[title_basics_filtered['startYear'] >= "2000"]
title_basics_filtered = title_basics_filtered.loc[title_basics_filtered['startYear'] <= "2017"]
title_basics_filtered = title_basics_filtered[['tconst', 'primaryTitle', 
                             'startYear', 'runtimeMinutes', 'genres']]
title_basics_filtered

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,genres
34822,tt0035423,Kate & Leopold,2001,118,"Comedy,Fantasy,Romance"
65547,tt0066853,Na Boca da Noite,2016,68,Drama
86845,tt0088751,The Naked Monster,2005,100,"Comedy,Horror,Sci-Fi"
92819,tt0094859,Chief Zabu,2016,74,Comedy
93991,tt0096056,Crime and Punishment,2002,126,Drama
95433,tt0097540,Responso,2004,81,\N
98103,tt0100275,The Wandering Soap Opera,2017,80,"Comedy,Drama,Fantasy"
100143,tt0102362,Istota,2000,80,"Drama,Romance"
103575,tt0105849,Xavier,2003,100,\N
105395,tt0107706,Stupid Lovers,2000,\N,\N


We filter out all ratings with at least 1000 votes:

In [52]:
title_ratings_filtered = title_ratings.loc[title_ratings['numVotes'] >= 1000]
title_ratings_filtered

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.8,1440
2,tt0000003,6.6,1041
4,tt0000005,6.2,1736
7,tt0000008,5.6,1539
9,tt0000010,6.9,5128
11,tt0000012,7.4,8602
12,tt0000013,5.7,1319
13,tt0000014,7.2,3741
24,tt0000026,5.7,1137
27,tt0000029,5.9,2451


In [53]:
movies_ratings = pd.merge(title_basics_filtered, title_ratings_filtered, on='tconst')
movies_ratings

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,genres,averageRating,numVotes
0,tt0035423,Kate & Leopold,2001,118,"Comedy,Fantasy,Romance",6.4,72480
1,tt0118589,Glitter,2001,104,"Drama,Music,Romance",2.2,20532
2,tt0118652,The Attic Expeditions,2001,100,"Comedy,Horror,Mystery",5.1,1531
3,tt0118694,In the Mood for Love,2000,98,"Drama,Romance",8.1,100863
4,tt0118852,Chinese Coffee,2000,99,Drama,7.3,3423
5,tt0118926,The Dancer Upstairs,2002,132,"Crime,Drama,Thriller",7.0,6013
6,tt0119004,Don's Plum,2001,89,"Comedy,Drama",5.8,3669
7,tt0119273,Heavy Metal 2000,2000,88,"Action,Adventure,Animation",5.4,7028
8,tt0120202,State and Main,2000,105,"Comedy,Drama",6.8,19303
9,tt0120263,Songs from the Second Floor,2000,98,"Comedy,Drama",7.7,15176


In [54]:
movies_ratings_crew = pd.merge(movies_ratings, title_crew, on='tconst')
movies_ratings_crew

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,genres,averageRating,numVotes,directors,writers
0,tt0035423,Kate & Leopold,2001,118,"Comedy,Fantasy,Romance",6.4,72480,nm0003506,"nm0737216,nm0003506"
1,tt0118589,Glitter,2001,104,"Drama,Music,Romance",2.2,20532,nm0193554,"nm0921985,nm0486824"
2,tt0118652,The Attic Expeditions,2001,100,"Comedy,Horror,Mystery",5.1,1531,nm0440948,nm0551138
3,tt0118694,In the Mood for Love,2000,98,"Drama,Romance",8.1,100863,nm0939182,nm0939182
4,tt0118852,Chinese Coffee,2000,99,Drama,7.3,3423,nm0000199,nm0507277
5,tt0118926,The Dancer Upstairs,2002,132,"Crime,Drama,Thriller",7.0,6013,nm0000518,nm0787649
6,tt0119004,Don's Plum,2001,89,"Comedy,Drama",5.8,3669,nm0730222,"nm0039192,nm0065818,nm0730222,nm0836476,nm0923673"
7,tt0119273,Heavy Metal 2000,2000,88,"Action,Adventure,Animation",5.4,7028,"nm0170402,nm0501341","nm0084253,nm0127560,nm0247653,nm0313274,nm0532..."
8,tt0120202,State and Main,2000,105,"Comedy,Drama",6.8,19303,nm0000519,nm0000519
9,tt0120263,Songs from the Second Floor,2000,98,"Comedy,Drama",7.7,15176,nm0027815,nm0027815


In [55]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from graphviz import Source
from sklearn import tree

In [56]:
import numpy as np

movies_ratings_crew['successfull'] = np.where(movies_ratings_crew['averageRating']>=7.0, '1', '0')
movies_ratings_crew

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,genres,averageRating,numVotes,directors,writers,successfull
0,tt0035423,Kate & Leopold,2001,118,"Comedy,Fantasy,Romance",6.4,72480,nm0003506,"nm0737216,nm0003506",0
1,tt0118589,Glitter,2001,104,"Drama,Music,Romance",2.2,20532,nm0193554,"nm0921985,nm0486824",0
2,tt0118652,The Attic Expeditions,2001,100,"Comedy,Horror,Mystery",5.1,1531,nm0440948,nm0551138,0
3,tt0118694,In the Mood for Love,2000,98,"Drama,Romance",8.1,100863,nm0939182,nm0939182,1
4,tt0118852,Chinese Coffee,2000,99,Drama,7.3,3423,nm0000199,nm0507277,1
5,tt0118926,The Dancer Upstairs,2002,132,"Crime,Drama,Thriller",7.0,6013,nm0000518,nm0787649,1
6,tt0119004,Don's Plum,2001,89,"Comedy,Drama",5.8,3669,nm0730222,"nm0039192,nm0065818,nm0730222,nm0836476,nm0923673",0
7,tt0119273,Heavy Metal 2000,2000,88,"Action,Adventure,Animation",5.4,7028,"nm0170402,nm0501341","nm0084253,nm0127560,nm0247653,nm0313274,nm0532...",0
8,tt0120202,State and Main,2000,105,"Comedy,Drama",6.8,19303,nm0000519,nm0000519,0
9,tt0120263,Songs from the Second Floor,2000,98,"Comedy,Drama",7.7,15176,nm0027815,nm0027815,1


Number of successfull movies:

In [57]:
movies_ratings_crew.successfull.value_counts()

0    10730
1     4770
Name: successfull, dtype: int64

Get only the first director, writer and genre:

In [58]:
for i, row in movies_ratings_crew.iterrows():
      movies_ratings_crew.at[i,'directors'] = movies_ratings_crew.at[i,'directors'].split(',')[0]

In [59]:
for i, row in movies_ratings_crew.iterrows():
      movies_ratings_crew.at[i,'writers'] = movies_ratings_crew.at[i,'writers'].split(',')[0]

In [61]:
for i, row in movies_ratings_crew.iterrows():
      movies_ratings_crew.at[i,'genres'] = movies_ratings_crew.at[i,'genres'].split(',')[0]

In [62]:
movies_ratings_crew

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,genres,averageRating,numVotes,directors,writers,successfull
0,tt0035423,Kate & Leopold,2001,118,Comedy,6.4,72480,nm0003506,nm0737216,0
1,tt0118589,Glitter,2001,104,Drama,2.2,20532,nm0193554,nm0921985,0
2,tt0118652,The Attic Expeditions,2001,100,Comedy,5.1,1531,nm0440948,nm0551138,0
3,tt0118694,In the Mood for Love,2000,98,Drama,8.1,100863,nm0939182,nm0939182,1
4,tt0118852,Chinese Coffee,2000,99,Drama,7.3,3423,nm0000199,nm0507277,1
5,tt0118926,The Dancer Upstairs,2002,132,Crime,7.0,6013,nm0000518,nm0787649,1
6,tt0119004,Don's Plum,2001,89,Comedy,5.8,3669,nm0730222,nm0039192,0
7,tt0119273,Heavy Metal 2000,2000,88,Action,5.4,7028,nm0170402,nm0084253,0
8,tt0120202,State and Main,2000,105,Comedy,6.8,19303,nm0000519,nm0000519,0
9,tt0120263,Songs from the Second Floor,2000,98,Comedy,7.7,15176,nm0027815,nm0027815,1


## 4. Data analysis

*Document you choice and motivation for selected data mining method(s) here. Choose a data mining method(s) to use in Python code to perform an analysis of your chosen dataset. Describe why you chose the method(s) and what interesting things you have found from the analysis.*

The method chosen for this analysis is classification and the technique used is the decision tree classifier. A decision tree is a methodology to reach a final conclusion by taking a complex decision and divide it into easier decisions (Safian, Landgrebe, 1991). By asking questions about the attributes, and then supplementary questions, in a hierarchical way, a decision or conclusion will finally be reached (Steinbach Kumar, 2014). A tree has three types of nodes, root node, internal nodes and leafs (2014).  When asking questions and receiving answers a path will follow the nodes until a leaf is reached and that is where the final decision or conclusion is being made (2014).

 Hunts algorithm?

This technique will enable the possibilities to do a prediction if a movie will get high rating or not by asking questions and following the nodes to see whether the movie fulfils all the criteria to become a highly rated movie. 

In [None]:
import collections
most_popular_directors = collections.Counter(movies_ratings_crew["directors"]).most_common()
most_popular_directors = pd.DataFrame(list(most_popular_directors))
most_popular_directors.columns = ["nconst", "No. of movies"]
most_popular_directors

In [None]:
most_popular_directors = pd.merge(most_popular_directors, name_basics, on='nconst')
most_popular_directors[['No. of movies', 'primaryName']]

## 5. Evaluation of results

*Document an evaluation your analysis results and describe how potentially actionable they are.*

In [None]:
# Add your own code

## 6. Schedule and description of project plan

*Rough schedule for the project beyond the pilot study presented in 3-5. This does not have to be advanced, you can simply provide an estimate based upon reported schedules for similar projects in the literature.*

The project plan for this project is based on the CRISP-DM model, showing the steps for best practice in data mining.

<img src="CRISP-DM.png">

*Figure 1: Phases of the CRISP-DM reference model*

**Business Understanding**

The first step of CRISP is to get an understanding of the business, the organisation and the environment.  In this case it is about to get an understanding of the movie production business and to see what criterias that can be considered as a successful movies. What factors  that will be considered as a successful data mining result should also be decided in this step. 

We have an idea of what results we want to get and how to get it, what will take most of the time, in this step, is to learn more about the business. Therefore, the estimated time for this will be XX (% / hours).


**Data Understanding**

In this step the data should be collected, described, verified and initially analysed. The data understanding is included in stage 3 of this report where the datasets from IMDb are collected, then described in terms of data types and visualised in a table. 


The collection of the data will not take much time since all of the datasets are retrieved from the same webpage. What will be more time-consuming in this step is to describe it and to get an understanding of what information we could get out of the datasets. The estimated time for this will be XX hours (% of the project). 


**Data Preparation**
The preparation involves the selection of data. Why certain data of the datasets is selected is motivated at the beginning of the report. The data sets are based on a relational database and therefore a part of the preparation is to merge the datasets together so it is possible to select the wanted data from it since we do not want everything that is in all of the sets. All the movies that are missing any important data will also be removed.

Even though the data is relatively clean from the beginning we do have to do quite a lot with it to make it useful for this project. We estimate that this will be quite time consuming and will take about XX % / hours. 

**Modeling**

Based on the data that was selected in the preparation step and the desired result, a data mining method should be chosen, and then the model should be built. In this project a decision tree will be created and based on our previous experience this will take about XX of the time.

**Evaluation** 

This is about evaluating the results from the data mining. How well it succeeded in relation to the success criterias that were decided earlier.We believe that this will not be taking so long since it should be relatively easy to see if we get at good result or not. The estimated time for this will be XX.

**Deployment**

A plan for the deployment should be created as well as a plan for maintenance. This plan will vary depending on the results of this project. The documentation will be in progress throughout this project and finished when the results from the data mining is reached. 

This is estimated to take XX

## 7. Ethical aspects that need to be considered

*Are there ethical aspects that need to be considered? Are there legal implications (e.g., personal data / GDPR)? Are there implications if the case organization is a business, public authority, or nonprofit entity?*

Since this report is analysing the data from movies, it is not necessary to consider any ethical aspects when it comes to personal information etc, since this is about public movies and data that we have collected from IMDb. 

However, one ethical perspective could be the that this type of analysis might hamper the creativity in the creation of movies. If movies are created from an algorithm based on earlier movies this could obstruct the production of new movies. 	

## References

Craig Batty (2015) *A screenwriter's journey into theme, and how creative writing research might help us to define screen production research*, Studies in Australasian Cinema, 9:2, pp. 110-121

imdb.com (-), *Press Room*. Available: https://www.imdb.com/pressroom/about/ [2018-11-19]

imdb.com (2018), [online] Available at: https://datasets.imdbws.com/ [Accessed 22 Nov. 2018].

Meenakshi, K., Maragatham, G., Agarwal, N. and Ghosh, I. (2018). *A Data mining Technique for Analyzing and Predicting the success of Movie*. Journal of Physics: Conference Series, 1000, p.012100.

Oliver, M. B. & Hartman, T. (2010). *Exploring the Role of Meaningful Experiences in Users' Appreciation of “Good Movies”*. Berghahn Journals,Vol. 4, Issue 2, pp 128–150
Rasoul Safavian, S., Landgrebe, D. (1991), Transactions on Systems, Man and Cybernetics, vol. 21, NO. 3, pp. 360-674. 

Sharda, R. and Delen, D. (2006). *Predicting box-office success of motion pictures with neural networks*. Expert Systems with Applications 30. Stillwater, Oklahoma, pp.243–254.


Tan P-N., Steinbach, M., Kumar,V. (2014), *Introduction to Data Mining*, 1st ed, 7th ed., Harlow: Pearson

Valenti, J. (1978). *Motion Pictures and Their Impact on Society in the Year
2001*, speech given at the Midwest Research Institute, Kansas City,
April 25, p. 3.

Wu, X., Zhu, X., Wu, G. and Ding, W. (2014). *Data Mining with Big Data*. Transactions on knowledge and data engineering. IEEE, pp. 97-107.


