# Motivation

## Intro

After browsing the various datasets we found on the world wide web, looking at old projects and brainstorming we decided that we wanted to look into movies. We decided to try and make a movie recommendation engine by using networking and text analysis. The initial idea was to find a dataset which provided waist amount of information about movies. Furthermore we wanted to find moviescripts and reviews for each movie in the dataset and do text analysis like sentiment analysis and similirity measures on these texts. The goal was to combine these information and create a recommendation engine where the user could insert a name of a movie that he liked and our algorithm would find and output movies that where similar to the inserted movie. 
    
Our goal was clear from the get go. We needed to find a dataset that fulfilled our requirements. We found a dataset available on Kaggle.com that was a good match for us. The dataset isn’t quite big, around 50 mb, but it contains a lot of information. It includes close to 20 properties for around 5000 movies. 

In the following section we will discuss each dataset that we found, how we manage to gather it and what the purpose behind it was. 


## Basic stats

### TMDB 5000 Movie dataset 

This was our central dataset in our analysis. This dataset fulfilled our requirements. It contain waiste amount of data and all of the information is gather from the site [TMDb.com](https://www.themoviedb.org) (The movie dataset). The dataset can be found and downloaded from this [website](https://www.kaggle.com/tmdb/tmdb-movie-metadata) 

** Basic stats **

 - The dataset comes in two files: 
     - **tmdb_5000_credits.csv**, the size is 40 MB and the variables in this file are the following:
         - *movie_id* : the uniq id of the movie
         - *title* : the name of the movie
         - *crew* : info about the crew members of the movie 
         - *cast* : info about the cast member of the movie
     - **tmdb_5000_movies.csv**, the size is 5.7 MB and the variables in this file are the following:
         - *movie_id* : the uniq id of the movie
         - *original_language*: we did not use this variable  
         - *title* : the name of the movie
         - *genres* : there are 17 movies genres in this dataset.
         - *tagline* : was not used 
         - *production_countries*: was not used
         - *production_companies* : the company/companies that produced the movie 
         - *popularity* : The total popularity of the movie in the TMDb database. We are not sure how this is actually calculated and it is hard to find information about it. 
         - *spoken_languages* : was not used 
         - *original_title* : was not used 
         - *release_date* :  the data when the movie was released 
         - *runtime* : the total length of the movie 
         - *vote_count* : how many users graded the movie
         - *vote_average* : the average grade that the movie got, we did not use this like we will come back to here down below 
         - *status* : was not used  
         - *revenue* : The total earnings of the movie.
         - *overview* : A small description about the movie, We did not use this like we will come back to here down below
         - *budget* : The amount spent on making the movie
         - *keywords* : Set of words that is explanatory of the movie
         - *homepage* :  was not used
 - The dataset contains information about 4800 movies
 - The movies span years from 1916-2017.

While this dataset has a lot to offer we decided to gather a bit more data ourselves. We had user rating and storyline (called overview in the tmdb databse) from TMDb but we also wanted to look at user ratings and the storyline from IMDb (Internet Movie Database) as they have much more votes and thus are more realistic.

** Data cleaning and preprocessing **

There where couple of movies in the database that lacked some informations. What we did is simply to check whether all of the data that we needed for each analysis was there and if not we ignored the movie from the analysis. This dataset is quite good and there are not many movies that lack data therefore we did not need to do much of an preprocessing of the dataset. 

**NOTE:** The Kaggle dataset also contains some TV shows, but they are very few and we did not consider it to be a problem for our analysis and therefore we did not delete them from the dataset. Also it would be quite hard to go over all of the 4800 items in the dataset and figure out what is a movie and what is a tv-show. 

### User rating and storyline from IMDB 

We used the a  library called ‘beautifulsoup’ and regex to scrape the [IMDb website](http://www.imdb.com/) to gather the user rating and the storyline for each movie in our kaggle dataset. The code that was used to accomplish this can be found in the [Get_additional_data notebook](https://github.com/gretarg09/Dtu-SocialGraphs-FinalProject/blob/master/Notebooks/Get_Additional_Data.ipynb) which can be found by clicking on the link. This produced two files:

[imdb-score.json](https://www.dropbox.com/sh/uw5uw9n6gtfpii0/AAB5UGCmpe6XqpOqPdVUxPLda?dl=0) : This file contains the IMDb rating of the movies in our Kaggle dataset


[imdb-score.json](https://www.dropbox.com/sh/uw5uw9n6gtfpii0/AAB5UGCmpe6XqpOqPdVUxPLda?dl=0) :  This file contains the IMDb storyline of the movies in our Kaggle dataset

By scraping and manually picking in the IMDb storyline and the IMDb rating for those movies that have non unicode caracters in their names we manage to gather information for all movies in or kaggle dataset except five. The way we handle that in our analysis is by simply ignoring the movies that we did not manage to get any IMDb information about when the data for those movies where needed. 

** Datacleaning and preprocessing: **

**TODO**

### Manuscripts

We wanted to add some text analysis to our recommendation engine, to do that we decided to find some movie manuscripts and see if we could find some interesting connections between the movie data we had and the movie’s scripts and also use the scripts to find similarities. The website [The Internet Movie Script Database (IMSDb)](http://www.imsdb.com/) ahd the most comprehensive database of manuscript that could find, unfortunately they are not offering the possibility of downloading the scripts but we found a script online that downloadeds all the manuscripts that can be found on the website. The code was taken from [this github repository](https://github.com/j2kun/imsdb_download_all_scripts) and adapted to our needs. Our code can be found in in the [Get_additional_data notebook](https://github.com/gretarg09/Dtu-SocialGraphs-FinalProject/blob/master/Notebooks/Get_Additional_Data.ipynb). The code uses the modules BeautifulSoup and urllip to scrape the IMSDb website and download the files. 

The total amount of manuscripts in IMSDb website is only 1116 manuscripts and the intersection with our kaggle dataset is 715 manuscripts. Therefore we only manage to find manuscript for 715 movies out of the 4800 movies that we have in our dataset. 

The total size of the manuscript dataset is 232,6 MB and can be downloaded [here](https://www.dropbox.com/sh/uw5uw9n6gtfpii0/AAB5UGCmpe6XqpOqPdVUxPLda?dl=0)

** Data cleaning and preprocessing: **

**TODO**

### Users Reviews

To add even more to our text analysis portion we decided to look at some user reviews. We found a data set that contained a 100.000 user reviews from IMDb on some 14.000 movies. The dataset is available at: http://ai.stanford.edu/~amaas/data/sentiment/. The dataset only contained the movie’s IMDb ID which we didn’t have so we again needed to do some scraping, this time we used the IMDb IDs to get the title of the movies so we could link them to our TMDb dataset. Here we only used urllib and regex to get the job done. 

That resulted in around 14.000 reviews for around 1.200 movies that were common with our 4800 gotten from the Kaggle dataset. The scripts and reviews were only applicable to a subset of our Kaggle dataset.


The code where we gather the reviews can be found in the [Sentiment_reviews notebook](https://github.com/gretarg09/Dtu-SocialGraphs-FinalProject/blob/master/Notebooks/Sentiment_reviews.ipynb) which can be found by clicking the link. 

This dataset can be downloaded from this [website](http://ai.stanford.edu/~amaas/data/sentiment/), like we mentioned here above and the size of the dataset is 220 MB. It can also be found [here](https://www.dropbox.com/sh/uw5uw9n6gtfpii0/AAB5UGCmpe6XqpOqPdVUxPLda?dl=0)

** Data cleaning and preprocessing**

**TODO**


# Tools, Theory and Analysis

Like we have mentioned here above our main idea was to creata a movie recommendation engine using graph theory and text analysis. We have already talked about how we managed to get our dataset and know we will focuse on the analysis part of the assignment. The analysis part can be divided into four parts; Basic statistics, Text analysis, Network analysis and then finally we donate a whole section to the final version of our recommendation engine. We did a lot of work in this assignment and therefore we divided the analysing work into 5 different notebooks, each notebook is well commented and we tried to include some conclusions for each part. 

## Basic statistics



## Text analysis

## Network analysis

## Recommendation engine


# Discussion


We are all huge movie fans and therefore we had a lot of fun doing this project. We did a lot of analysis on the data set like you have probably noticed. We are quite happy about our findings and we are really proud of the final version of the recommendation engine. The recommendation engine is functioning quite well and we'll definitely be using it ourselves during the Christmas break when looking for movies to watch and we encourage you to try it out on the website.

Although it turned out great we tried quite a lot of different things before we were satisfied. For the network part we looked at three different networks. We created a network where movies were connected if they shared a certain percentage of their keywords. We did the same for genres as well as one network where nodes were cast members and nodes were linked if the cast members had starred together in a certain number of movies. After analyzing each of the networks we decided that none of them were good enough. This resulted in us creating a fourth network which would combine the properties of the previous three. Movies were nodes and to create links to other movies we looked at multiple properties, e.g. cast members, keywords, genres, directors and similarities in story line. After links have been created and we’re in the process of recommending we also look at IMDb ratings. 

The text analysis itself went well but it ended up not being used in the recommendation engine except for the story lines of the movies. The reason for this was that the data we had for text analysis, ~14.000 reviews and ~1.000 movie scripts, only applied to a subset of our 4800 movies. We didn't want to make the number of movies that we could recommend less than it was at the beginning. We did a lot of text analysis and made a proof of concept that the data could’ve been used in the recommendation engine. This tells us that we could have further improved the engine by getting manuscripts and reviews for all our movies. If not the manuscript then we could get the subtitle texts for each movie and do the same analysis on those texts and use the information to further improve our engine. 

**Retrospect, what could have been done differently**

- We encountered some problems when scraping the IMDb website because we where basing our scraping on the name of the movie. When we look back we should have found a way to map the TMDb ids to the IMDb ids and use that mapping to find the corresponding IMDb website. The difficulties arose because the name of the movies cannot be considered as a unique id. Let's illustrate this with an example; Avatar is a movie that most of us know and was directed by the well known director James Cameron, but Avatar (Avatar: The Last Airbender) is also animation movie that has an impressive IMDb rating of 9.2. When we scraped the internet we actually got this movie's rating for Avatar but not the James Cameron version which was the movie that we where looking for. This did however not seem to happen often and we manage to fix most if not all of the error but it took a lot of manual work and it could have been avoided by simply mapping the TMDb ids to the IMDb ids. Therefore the lesson learn is to always use the unique ids when doing these kind of data gathering and analysis

- Like we have mentioned multiple times here in this notebook, we failed to gather manuscripts for all movies. However the subtitle text for each movie can be found on the world wide web and in retrospect it could have been a better idea to download the subtitle text for each movie and do text analysis on that data set. In that way we could have gathered text for each movie and used the finding to improve our recommendation engine further. 

** Future work**

What we want to do next is to scale this assignment up. What we want to do can be summerized in the follwing bullets.
 - Our recommendation engine is limited to the amount of movies that we have in the dataset. [The movie database](https://www.themoviedb.org) contains information about ~400000 movies and we would like to expand the capabilities of our recommendation engine by including more movies into the dataset. 
 - We would like to include the sentiment analysis and/or the similarity measures of the manuscripts in our recommendation engine so that movies with similar mean sentiment value and that contain similiar manuscripts will become more likely to have smaller distances between them in our network. Like we have discussed in the assignment, we have already made some proof of consepts analysis and the only thing standin in our way is the problem of finding manuscripts for each movie. If it is not possible to get manuscript for the majority of the movies then we would like to include analysis of the subtitle texts of the movies in our recommendation engine. 
 - We did not create any backend service for the project (website). It would be a good idea to create a backend service for the website and host it on some external server, e.g Heroku or Amazon.


############################################################################################################



**Tala um þetta seinna**

It was clear to us that we where not able to use this in our recommendation engine because we did not have information about each movie in the dataset. We still manage to get manuscripts for 715 movies in our dataset and we decided to analyse these manuscript thoroughly with the goal of finding out if we could use these 


***Grétar, hvernig sóttiru scripts???***.  This was a problem as we didn’t want to make a recommendation engine that only contained a bit more than a 1.000 movies. We however did a thorough analysis of the scripts to make a proof of concept that it could’ve been used as input into our recommendation engine. We after that decided to get more text data. We thought about what we could use other than scripts and reviews to try and find similarities between movies and ended up with trying to use the movie’s storyline. This was data we didn’t have, so once again we turned to scraping. We had the help of a library called ‘beautifulsoup’. After this we finally had some nice juicy text to use in our engine.

The end goal was a movie recommendation engine where a user could write in a movie and we would recommend similar movies the user might like. We wanted to function quite well and even use it ourselves. We’ve all spent a lot of time in our lives trying to decide on a movie to watch. The recommendation engine will use shortest path calculations on a network that we design. There will be multiple things affecting the links between movies, e.g. cast of movies, keywords, genres, directors, IMDb ratings and similarities in storylines. 
