# Motivation

## Intro

After browsing the various datasets we found on the world wide web, looking at old projects and brainstorming we decided that we wanted to look into movies. We decided to try and make a movie recommendation engine by using networking and text analysis. The initial idea was to find a dataset which provided waist amount of information about movies. Furthermore we wanted to find moviescripts and reviews for each movie in the dataset and do text analysis like sentiment analysis and similirity measures on these texts. The goal was to combine these information and create a recommendation engine where the user could insert a name of a movie that he liked and our algorithm would find and output movies that where similar to the inserted movie. 
    
Our goal was clear from the get go. We needed to find a dataset that fulfilled our requirements. We found a dataset available on Kaggle.com that was a good match for us. The dataset isn’t quite big, around 50 mb, but it contains a lot of information. It includes close to 20 properties for around 5000 movies. 

In the following section we will discuss each dataset that we found, how we manage to gather it and what the purpose behind it was. 


## Basic stats

### TMDB 5000 Movie dataset 

This was our central dataset in our analysis. This dataset fulfilled our requirements. It contain waiste amount of data and all of the information is gather from the site [TMDb.com](https://www.themoviedb.org) (The movie database). The database can be found and downloaded from this [website](https://www.kaggle.com/tmdb/tmdb-movie-metadata) 

** Basic stats **

 - The dataset comes in two files: 
     - **tmdb_5000_credits.csv**, the size is 40 MB and the variables in this file are the following:
         - *movie_id* : the uniq id of the movie
         - *title* : the name of the movie
         - *crew* : info about the crew members of the movie 
         - *cast* : info about the cast member of the movie
     - **tmdb_5000_movies.csv**, the size is 5.7 MB and the variables in this file are the following:
         - *movie_id* : the uniq id of the movie
         - *original_language*: we did not use this variable  
         - *title* : the name of the movie
         - *genres* : there are 17 movies genres in this dataset.
         - *tagline* : was not used 
         - *production_countries*: was not used
         - *production_companies* : the company/companies that produced the movie 
         - *popularity* :  
         - *spoken_languages* : was not used 
         - *original_title* : was not used 
         - *release_date* :  the data when the movie was released 
         - *runtime* : the total length of the movie 
         - *vote_count* : how many users graded the movie
         - *vote_average* : the average grade that the movie got, we did not use this like we will come back to here down below 
         - *status* : was not used  
         - *revenue* : The total earnings of the movie.
         - *overview* : A small description about the movie, We did not use this like we will come back to here down below
         - *budget* : The amount spent on making the movie
         - *keywords* : Set of words that is explanatory of the movie
         - *homepage* :  was not used
 - The dataset contains information about 4800 movies
 - The movies span years from 1916-2017.

While this dataset has a lot to offer we decided to gather a bit more data ourselves. We had user rating and storyline (called overview in the tmdb databse) from TMDb but we also wanted to look at user ratings and the storyline from IMDb (Internet Movie Database) as they have much more votes and thus are more realistic.


### User rating and storyline from IMDB 

We used the a  library called ‘beautifulsoup’ and regex to scrape the [IMDb website](http://www.imdb.com/) to gather the user rating and the storyline for each movie in our kaggle database. The code that was used to accomplish this can be found in the [Get_additional_data notebook](https://github.com/gretarg09/Dtu-SocialGraphs-FinalProject/blob/master/Notebooks/Get_Additional_Data.ipynb) which can be found by clicking on the link. This produced two files:

[imdb-score.json](https://github.com/gretarg09/Dtu-SocialGraphs-FinalProject/blob/master/Data/imdb-score.json) : This file contains the IMDb rating of the movies in our Kaggle database


[imdb-score.json](https://github.com/gretarg09/Dtu-SocialGraphs-FinalProject/blob/master/Data/imdb-storyline.json) :  This file contains the IMDb storyline of the movies in our Kaggle database

By scraping and manually picking in the IMDb storyline and the IMDb rating for those movies that have non unicode caracters in their names we manage to gather information for all movies in or kaggle database except five. The way we handle that in our analysis is by simply ignoring the movies that we did not manage to get any IMDb information about when the data for those movies where needed. 
    

### Manuscripts

Now we wanted to add some text analysis to our recommendation engine, to do that we decided to find some movie scripts and see if we could find some interesting connections between the movie data we had and the movie’s scripts and also use the scripts to find similarities. We again found a script online that downloadeds all the manuscripts on the website [The Internet Movie Script Database (IMSDb)](http://www.imsdb.com/). This code was taken from the [this github repository](https://github.com/j2kun/imsdb_download_all_scripts) and adapted to our needs. Our code can be found in in the [Get_additional_data notebook](https://github.com/gretarg09/Dtu-SocialGraphs-FinalProject/blob/master/Notebooks/Get_Additional_Data.ipynb). 

The total amount of manuscripts in IMSdb is only 1122 manuscripts and the intersection with our kaggle database is 686 manuscripts. Therefore we only manage to find manuscript for 686 movies out of the 4800 movies that we have in our database. 

The total 






TAlA um þetta seinna

It was clear to us that we where not able to use this in our recommendation engine because we did not have information about each movie in the database. We still manage to get manuscripts for 686 movies in our database and we decided to analyse these manuscript thoroughly with the goal of finding out if we could use these 




***Grétar, hvernig sóttiru scripts???***. To add even more to our text analysis portion we decided to look at some user reviews. We found a data set that contained a 100.000 user reviews from IMDb on some 14.000 movies. The dataset is available at: http://ai.stanford.edu/~amaas/data/sentiment/. The dataset only contained the movie’s IMDb ID which we didn’t have so we again needed to do some scraping, this time we used the IMDb IDs to get the title of the movies so we could link them to our TMDb dataset. That resulted in around 14.000 reviews for around 1.200 movies that were common with our 5.000 gotten from the TMDb dataset. The scripts and reviews were only applicable to a subset of our big movie dataset. This was a problem as we didn’t want to make a recommendation engine that only contained a bit more than a 1.000 movies. We however did a thorough analysis of the scripts to make a proof of concept that it could’ve been used as input into our recommendation engine. We after that decided to get more text data. We thought about what we could use other than scripts and reviews to try and find similarities between movies and ended up with trying to use the movie’s storyline. This was data we didn’t have, so once again we turned to scraping. We had the help of a library called ‘beautifulsoup’. After this we finally had some nice juicy text to use in our engine.

The end goal was a movie recommendation engine where a user could write in a movie and we would recommend similar movies the user might like. We wanted to function quite well and even use it ourselves. We’ve all spent a lot of time in our lives trying to decide on a movie to watch. The recommendation engine will use shortest path calculations on a network that we design. There will be multiple things affecting the links between movies, e.g. cast of movies, keywords, genres, directors, IMDb ratings and similarities in storylines. 
