# Motivation

## Intro

After browsing the various datasets we found on the world wide web, looking at old projects and brainstorming we decided that we wanted to look into movies. We decided to try and make a movie recommendation engine by using networking and text analysis. The initial idea was to find a dataset which provided a vast amount of information about movies. Furthermore we wanted to find moviescripts and reviews for each movie in the dataset and do text analysis on them, more specifically sentiment analysis and similirity measures. The goal was to combine all this information and create a recommendation engine where the user could insert the name of a movie that he liked and our algorithm would find and output movies that where similar to the inserted movie. 
    
Our goal was clear from the get go. We needed to find a dataset that fulfilled our requirements. We found a dataset available on Kaggle.com that was a good match for us. The dataset isn’t quite big, around 50 mb, but it contains a lot of information. It includes close to 20 properties for around 5000 movies. 

In the following section we will discuss each dataset that we found, how we manage to gather it and what the purpose behind it was. 

**NOTE** All of the analysis we do are limited in the sense that we only have information about 5000 movies, therefore the result cannot be taken as the whole truth. However, these analysis still gives us a good idea of the general behavior of movies.

## Basic stats

### TMDb 5000 Movie dataset 

This was the central dataset in our analysis. This dataset fulfilled our requirements. It contains a vast amount of data and all of the information is gather from the site [TMDb.com](https://www.themoviedb.org) (The Movie Database). The dataset can be found and downloaded from this [website](https://www.kaggle.com/tmdb/tmdb-movie-metadata).

** Basic stats **

 - The dataset comes in two files: 
     - **tmdb_5000_credits.csv**, the size is 40 MB and the variables in this file are the following:
         - *movie_id* : the unique id of the movie
         - *title* : the name of the movie
         - *crew* : info about the crew members of the movie 
         - *cast* : info about the cast member of the movie
     - **tmdb_5000_movies.csv**, the size is 5.7 MB and the variables in this file are the following:
         - *movie_id* : the unique id of the movie
         - *original_language*: we did not use this variable  
         - *title* : the name of the movie
         - *genres* : there are 17 movie genres in this dataset.
         - *tagline* : was not used 
         - *production_countries*: was not used
         - *production_companies* : the company/companies that produced the movie 
         - *popularity* : the total popularity of the movie in the TMDb database. We are not sure how this is actually calculated and it is hard to find information about it. 
         - *spoken_languages* : was not used 
         - *original_title* : was not used 
         - *release_date* :  the date when the movie was released 
         - *runtime* : the total length of the movie 
         - *vote_count* : how many users rated the movie
         - *vote_average* : the average rating that the movie got, we did not use this as we will explain down below 
         - *status* : was not used  
         - *revenue* : the total earnings of the movie.
         - *overview* : a small description about the movie, we did not use this as we will explain down below
         - *budget* : The amount spent on making the movie
         - *keywords* : Set of words that is explanatory of the movie
         - *homepage* :  was not used
 - The dataset contains information about 4800 movies
 - Number of cast members ~ 72000
 - Number of movie genres: 17 
 - The movies span over 100 years (1916-2017).

While this dataset has a lot to offer we decided to gather a bit more data ourselves. We had user ratings and storylines (called overview in the dataset) from TMDb but we also wanted to look at user ratings and the storylines from IMDb (Internet Movie Database) as the ratings have much more votes and thus are more realistic.

** Data cleaning and preprocessing **

There where a couple of movies in the database that lacked some information. What we did was simply to check whether all of the data that we needed for each analysis was there and if not we ignored the movie from the analysis. This dataset is quite good and there are not many movies that lack data therefore we did not need to do much of preprocessing of the dataset. 

**NOTE:** The Kaggle dataset also contains some TV shows, but they are very few and we did not consider it to be a problem for our analysis and therefore we did not delete them from the dataset. 

### User ratings and storylines from IMDb 

We used a library called 'beautifulsoup' and regular expression to scrape the [IMDb website](http://www.imdb.com/) to gather the user ratings and the storyline for each movie in our Kaggle dataset. The code that was used to accomplish this can be found in the [Get_additional_data notebook](https://github.com/gretarg09/Dtu-SocialGraphs-FinalProject/blob/master/Notebooks/Get_Additional_Data.ipynb). This produced two files:

[imdb-score.json](https://www.dropbox.com/sh/uw5uw9n6gtfpii0/AAB5UGCmpe6XqpOqPdVUxPLda?dl=0) : This file contains the IMDb ratings of the movies in our Kaggle dataset. The size of the file is 85 KB.


[imdb-storyline.json](https://www.dropbox.com/sh/uw5uw9n6gtfpii0/AAB5UGCmpe6XqpOqPdVUxPLda?dl=0) :  This file contains the IMDb storylines of the movies in our Kaggle dataset. The size of the file is 2.8 MB.

By scraping, and manually gathering the IMDb storylines and the IMDb ratings for those movies that have non unicode characters in their names, we managed to gather information for all movies in our Kaggle dataset except five. The way we handle that in our analysis is by simply ignoring the movies that we did not manage to get any IMDb information about when the data for those movies were needed. 

** Datacleaning and preprocessing: **

We needed several iterations to get our scraping to work as we wanted but once we were happy with that the data itself was clean and simple and didn't need any further processing. The main problems with our scraping initially was getting the regular expression right as to not miss out on titles that contained punctuation in them (e.g. "National Lampoon's Loaded Weapon 1" and "The Godfather: Part II"). Another barrier that we needed to overcome was finding out how to correctly use 'beautifulsoup'.

### Manuscripts

We wanted to add some text analysis to our recommendation engine, to do that we decided to find some movie manuscripts and see if we could find some interesting connections between the movie data we had and the movies' scripts and also use the scripts to find similarities. The website [The Internet Movie Script Database (IMSDb)](http://www.imsdb.com/) had the most comprehensive database of manuscripts that we could find, unfortunately they are not offering the possibility of downloading the scripts but we found a python script online that downloads all the manuscripts that can be found on the website. The code was taken from [this github repository](https://github.com/j2kun/imsdb_download_all_scripts) and adapted to our needs. Our code can be found in in the [Get_additional_data notebook](https://github.com/gretarg09/Dtu-SocialGraphs-FinalProject/blob/master/Notebooks/Get_Additional_Data.ipynb). The code uses the modules 'beautifulsoup' and 'urllib' to scrape the IMSDb website and download the files. 

The total amount of manuscripts in IMSDb website is only 1116 manuscripts and the intersection with our Kaggle dataset is 715 manuscripts. Therefore we only managed to find manuscripts for 715 movies out of the 4800 movies that we have in our dataset. 

The total size of the manuscript dataset is 232,6 MB and can be downloaded [here](https://www.dropbox.com/sh/uw5uw9n6gtfpii0/AAB5UGCmpe6XqpOqPdVUxPLda?dl=0)

** Data cleaning and preprocessing: **

We did some minor preprocessing of each manuscript, each manuscript contains scene headings that we did not consider to be important. Therefore we excluded those lines from the sentiment analysis. Scene headings are usually written in all upper case letters, we there fore eliminate sentences if it only contains upper case letters. We also removed punctuation. We also removed stopwords and used the nltk.regexp_tokenizer to tokenize the text when doing the TF-IDF vector.

### User Reviews

To add even more to our text analysis portion we decided to look at some user reviews. We found a data set that contained a 100.000 user reviews from IMDb on some 14.000 movies. The dataset is available at: http://ai.stanford.edu/~amaas/data/sentiment/. The dataset only contained the movies' IMDb IDs which we didn’t have so we again needed to do some scraping, this time we used the IMDb IDs to get the title of the movies so we could link them to our TMDb dataset. Here we only used 'urllib' and regular expression to get the job done. 

This resulted in around 14.000 reviews for around 1.200 movies that were common with our 4800 movies from the Kaggle dataset. The scripts and reviews were only applicable to a subset of our Kaggle dataset.


The code where we gather the reviews can be found in the [Sentiment_reviews notebook](https://github.com/gretarg09/Dtu-SocialGraphs-FinalProject/blob/master/Notebooks/Sentiment_reviews.ipynb). 

This dataset can be downloaded from this [website](http://ai.stanford.edu/~amaas/data/sentiment/), like we mentioned here above and the size of the dataset is 220 MB. It can also be found [here](https://www.dropbox.com/sh/uw5uw9n6gtfpii0/AAB5UGCmpe6XqpOqPdVUxPLda?dl=0)

** Data cleaning and preprocessing**

We needed again several iterations to get our scraping to work as we wanted but once we were happy with that the data itself was clean and simple and didn't need any further processing. The main problems with our scraping initially was getting the regular expression right as to not miss out on titles that contained punctuation in them (e.g. "National Lampoon's Loaded Weapon 1" and "The Godfather: Part II").

# Tools, Theory and Analysis

Like we have mentioned here above our main idea was to create a movie recommendation engine using graph theory and text analysis. We have already talked about how we managed to get our data and now we will focus on the analysis part of the assignment. The analysis part can be divided into four parts; 'Basic statistics', 'Text analysis', 'Network analysis' and then finally we dedicate a whole section to the final version of our 'Recommendation engine'. We did a lot of work in this assignment and therefore we divided the analysing into 5 different notebooks, each notebook is well commented and we tried to include some conclusions for each part. 


## Basic statistics

### The idea

We decided that we wanted to get a better idea of what our data looked like and see what it had to offer. At the same time we wanted to allow interested readers to see the different aspects of the movie industry through visualization of our data. We also wanted to use this as an opportunity to try and figure out which parts of the data are useful in our recommendation engine.
 
### Tools

To do this we used a lot of tools from 'matplotlib.pyplot'. For example barplots, histograms and scatterplots. We also used a heatmap from 'seaborn' to show the correlation between a number of different aspects of our data.

### Application

** Basic Statistics ** - [Basic_statisctics notebook](https://github.com/gretarg09/Dtu-SocialGraphs-FinalProject/blob/master/Notebooks/Basic%20Statistics.ipynb)

The Basic Statistics notebook is well commented and there is a lot of discussions and visualization of our data. We encourage the reader to go over the notebook.

Some things that we looked at in the notebook:
* IMDb rating distribution
* TMDb rating vs budget
* Movie's popularity vs it's release month
* Correlation between a movie's revenue and it's budget
* Correlation between a movie cast's gender ratio and a movie's revenue
* Correlation between 9 aspects of a movie using a heatmap
* Movie genres, number of movies in each, their average IMDb rating and their revenue
* Biggest production companies in our dataset
* Number of movies produced each year as well as the movie revenue per year
* Keywords analysis, both frequency histogram and word cloud representation for each genre
* Stats about the cast and crew of the movies
* Similarities between movies' storylines
* Word clouds for keywords in each genre

### Outcome

We got a nice visualization of our data and at the same time got familiar with working with the dataset. We also came to the conclusion that the most interesting aspects of the data we intend to use in the recommendation engine are; the keywords of a movie, the movie's genre/s, the cast and the director of the movie. Later on we added storyline similarities and the IMDb ratings of the movies to the recommendation engine.



## Text analysis

### The idea 

The idea was to get manuscripts and reviews for the majority of the movies in our Kaggle dataset. We wanted to do a similarity measurement between each pair of manuscripts and calculate the mean sentiment value of each manuscript as well as the variance of the sentiment values. The similarity measures give us valuable information about how similar two movies' storylines are and the sentiment values gives us information about the vocabulary used in each movie, how happy on average the manuscript is and also how much of an contrast there is between usage of happy words and unhappy words in each movie's manuscript.

### Tools

**Sentiment analysis:** We used the happyness rank from the LabMT wordlist to analyze the sentiment value for each manuscript. It's available as supplementary material from Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter (Data Set S1). It can be accessed on [this link](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0026752#s2). The article has information about how the sentiment value is calculated for a text and how the dataset was created. 

We calculated the sentiment score in two different analysis. Firstly in the manuscript analysis where we analyzed the sentiment score of the manuscripts and in the review analysis where we analysed the sentiment scores of the movies' reviews.

**TF-IDF: ** The TF-IDF (Term Frequency - Inverse Document Frequency) is a numerical statistic of how important a word is to a document in a collection or a corpus. It is calculated by calculating the frequency of occurrences of a word in a text multiplied with the inverse document frequency which gives lower scores to words that frequently occur in other texts in the corpus as well. In other words, TF-IDF is an indicator of how important/special a word is actually in a text, giving high scores to the words that differentiates it from other texts. 

We used TF-IDF in two different analysis; First in the **manuscripts analysis** where we calculated the TF-IDF vector for each movie genre and we used that to make wordclouds which correlated the TF-IDF score of a word with the size of the word in the wordcloud. Secondly, we calculated the TF-IDF for each movie's **IMDb storyline**. We the used the resulting vectors to calculate the cosine similarity measurement between the movies' storylines and we used that as a variable into our recommendation engine like we will go over in the 'Recommendation Engine' section.

**NOTE** We used the TfidfVectorizer function in the python module sklearn.feature_extraction.text to calculate the TF-IDF vectors.

### Applications

** Manuscripts analysis ** - [Manuscripts_analysis notebook](https://github.com/gretarg09/Dtu-SocialGraphs-FinalProject/blob/master/Notebooks/Manuscripts_analysis.ipynb)

Like we have discussed here above we managed to gather 715 manuscripts that intersect with our Kaggle dataset (manuscripts for the movies in our Kaggle dataset). As we did not manage to find manuscripts for each movie we consider these analysis to be a proof of concept. A proof of concept in the way that we wanted to find out if we could find some information that would help to improve our recommendation engine. It is also fun and interesting to analyze these manuscripts and in the [Manuscripts_analysis notebook](https://github.com/gretarg09/Dtu-SocialGraphs-FinalProject/blob/master/Notebooks/Manuscripts_analysis.ipynb) we tried to answer the following questions:

  - How is the happiness rank distribution of these manuscripts?
  - What are the 10 'unhappiest manuscripts' and what are the 10 'happiest  manuscripts'?
  - Is there a correlation between IMDb ranking and happiness rank?
  - Is there a correlation between movie revenue and happiness rank? 
  - Is there a difference in the happiness rank of different movie genres?
  - Is there a difference in the variance of the sentiment values between different movie genres?
  - What directors direct the 'happiest' movies on average and what directors direct the 'unhappiest' movies on average?
  - Can we use TF/IDF to figure out what words are important for each genre and then visualize that in a wordcloud?
  
  
  **Note:**  Happiness rank means the mean sentiment value of the manuscript

The Manuscripts Analysis notebook is well commented and there are a lot of discussions and conclusions in there that should be interesting to look at. We encourage the reader to go over the notebook. 

** Reviews analysis ** - [Sentiment_reviews notebook](https://github.com/gretarg09/Dtu-SocialGraphs-FinalProject/blob/master/Notebooks/Sentiment_reviews.ipynb)

In this notebook we looked at sentimental analysis of user reviews from IMDb:
 - We have 13635 reviews for the 1304 movies that were common between our Kaggle dataset and the dataset that contained the reviews.
 - The original reviews dataset is available [here](http://ai.stanford.edu/~amaas/data/sentiment/). It contains 100.000 reviews from IMDb.
 - In this notebook we did the following:
     - We plotted the sentimental score distribution of the reviews for our movies.
     - We plotted the sentimental score of the movies against the IMDb rating of said movie and checked if there was any correlation.
     - Lastly we analyzed the correlation of the sentimental score of a review and the reviewer's rating of the movie, in this part we will use all reviews that have a rating (50.000 reviews).

** Note ** Sentiment score is the same as happiness score which is the same as mean sentiment value. It is important to have this in mind when you go over the analysis in the notebooks.

** Recommendation engine ** - [recommendation_engine notebook](https://github.com/gretarg09/Dtu-SocialGraphs-FinalProject/blob/master/Notebooks/recommendation_engine.ipynb)

Here we did a similarity measure on the IMDb storyline using the TF-IDF and cosine similarity measurement. We will discuss this notebook further in the 'Recommendation Engine' section.

### Outcome

The work that we did can be found in the previously mentioned notebooks. We encourage you to read through them and check out our work, each notebook contains a conclusion part where we summarize the findings. We will only discuss a part of the outcomes here down below because it would take too much space and time to go over it all. 

Like we have mentioned previously our intention was to use the text analysis on the manuscripts and the reviews to improve our recommendation engine. But because we failed in finding manuscripts and reviews for the majority of our movies we decided to not include it in the recommendation engine but what we did instead was to scrape the movies' storylines from IMDb and use similarity measures on them as a variable into our recommendation engine which resulted in a improved recommendation engine. 

That being said, we wanted to do some analysis on the manuscripts and the reviews. We looked at this part as a sort of a proof of concept, we wanted to see if there was some interesting information to be found and furthermore see if the information could be used to improve our recommendation engine. Therefore proof the concept and, in future work, if we could get our hands on reviews and manuscript for the majority of the movies we could easily use the information as input into our recommendation engine. Our main focus was on the manuscript part but we also did some interesting analysis on the reviews. We will only discuss the findings in the manuscript analysis here but we encourage you to take a look at the reviews analysis. 

We found out that both the mean sentiment value of a movie's manuscript and the variance of the sentiment values of the movie's manuscripts provide useful information about what kind of a movie we are talking about. For example if the mean sentiment value is high then the movie is most likely a so called 'feel good movie' and we could recommend movies that have similar mean sentiment values. Likewise we have discussed how the variance of the sentiments give valuable information about the movie, if the variance is high then there is a lot of contradictions in the story line meaning for example that the movie is either a dark movie that has some happy moments in it (romance or comedy) or that the movie is a dark comedy (e.g 'Deadpool').

We also found out that the most 'happy' movie genres on average (based on the sentiment analysis) are 'Music', 'Romance' and 'Comedy' and the most 'unhappy' genres are 'Action' and unsurprisingly 'Horror'. 

We did a lot more than what we have discussed here, the information and discussion can be found in the notebook. The decisions taken about preprocessing and datacleaning can also be found before each appropriate part.


## Network analysis

### The idea

Here we will analyze three networks:
* __Movie network(cast):__ This is a network where nodes will be movies and there will be a link between nodes if 10% of the cast is the same between the two nodes(movies).
* __Movie network(keywords):__ This is a network where nodes will be movies and there will be a link between nodes if the movies share 30% of their keywords. Keywords are used to describe the movie in simple words such as for e.g. 'Avatar' the keywords are: 'culture clash', 'future', 'space war', 'society', 'futuristic', 'romance', 'space', etc.
* __Cast network:__ This is a network where nodes will be cast members from our movies and there will be a link between nodes if the cast members have starred in three or more movies together.


### Tools

**Betweenness centrality: ** In a traditional setting, this measure calculates all shortest paths in the network and then each node gets a score according to which fraction of all shortest paths pass through that node. By calculating the betweenness centrality in our network we can find the biggest hubs.

**Degree distribution: ** Is the probability that a randomly chosen node has a certain degree. In a given realization of a random network some nodes gain numerous links, while others acquire only a few or no links. By looking at the degree distribution we can get an idea of how the network is structured. For example is it scale-free or perhaps a random network. There is more on degree distribution in the [Network Science Book](http://barabasi.com/networksciencebook/chapter/3#degree-distribution).

**Clustering coefficient: ** The degree of a node contains no information about the relationship between a node's neighbors. Do they all know each other, or are they perhaps isolated from each other? The answer is provided by the local clustering coefficient $C_i$, that measures the density of links in node i's immediate neighborhood: $C_i$ = 0 means that there are no links between i's neighbors; $C_i$ = 1 implies that each of the i's neighbors link to each other. From [Network Science Book](http://barabasi.com/networksciencebook/chapter/3#clustering-3-9).

**Modularity** is a measurement that allows us to quantify the goodness of a partition of a network into communities, where partition is a division of a network into an arbitrary number of groups such that each node belongs to one and only one group. More specifically, Modularity is a concept that measures systematic deviations from a random configuration of a network. This helps us indentifying groups that are embedded in a network, and finding nodes that interact more frequantly with each other than in a random network. Therefore modularity is simply a measurement of the systematic deviations from a random configuration. Networks with high modularity have dense connections between the nodes within modules but sparse connections between nodes in different modules. 
We used the [Python Louvain-algorithm](http://perso.crans.org/aynaud/communities/) implementation to find communities in our networks. We then compared the communities of the network found by the algorithm to the genres of the movies. 

### Applications

** Networks ** - [Networks notebook](https://github.com/gretarg09/Dtu-SocialGraphs-FinalProject/blob/master/Notebooks/Networks.ipynb)

In this notebook we created and analysed the previous mentioned networks. More specifically, For all networks we took a look at the following:
* What's the maximum and minimum degree?
* How is the degree distribution?
* Visualize the networks.
* Analyze Python-Louvain communities and modularity.
* Look at betweenness centrality.
* Look at clustering coefficients.

The end goal is to find a network that we would be satisfied with using as the network for our recommendation engine.

### Outcome

None of these networks will suffice by themselves for our recommendation engine, however the analysis gave us a valuable insight into our data and by doing these analysis we got the idea of the final network. We will therefore make a network that is a mix of many different properties. That network will be crafted in a notebook of it's own as the code is quite long. We will discuss said network below in the section 'Recommendation Engine'.

## Recommendation Engine

### The idea

The main network is build upon our analysis on simpler networks. The first three networks we tried did not yield the desired result so now we are going to merge them together to try and make a smart recommendation engine.

The idea is that a user inputs a name of a movie (which will be a node in the network) and from that movie we find all the shortest distances in our network and recommend those movies (nodes) wich are closest in the network. It was decided to use a number of things to build this network, in the pursuit to make as good a recommendation engine as possible. 

For the network we will look at the following:
* What's the maximum and minimum degree?
* How is the degree distribution?
* Visualize the networks.
* Analyze Python-Louvain communities and modularity.
* Look at betweenness centrality.
* Look at clustering coefficients.

### Tools

Here we used the same tools as in the 'Networks' section with the addition of **Djikstra shortest path** and **TF-IDF**.

**Djikstra shortest path: ** The algorithm creates a tree of shortest paths from the starting vertex, the source, to all other points in the graph.

**TF-IDF: ** The TF-IDF (Term Frequency - Inverse Document Frequency) is a numerical statistic of how important a word is to a document in a collection or a corpus. It is calculated by calculating the frequency of occurrences of a word in a text multiplied with the inverse document frequency which gives lower scores to words that frequently occur in other texts in the corpus as well. In other words, TF-IDF is an indicator of how important/special a word is actually in a text, giving high scores to the words that differentiates it from other texts. We calculated the TF-IDF for each movie's **IMDb storyline**. We the used the resulting vectors to calculate the cosine similarity measurement between the movie's storyline and we used that as a variable into our recommendation engine.

### Applications

** Recommendation Engine ** - [recommendation_engine notebook](https://github.com/gretarg09/Dtu-SocialGraphs-FinalProject/blob/master/Notebooks/recommendation_engine.ipynb)

We decided to implement a weighted network, since nodes which are connected can have different importance. For example, if we have some movie, lets say 'Iron Man', our network can  have multiple links from that movie. However, these movies (nodes) it links to are not all equally similar to 'Iron Man', and thus we add a weight to the edge. Lastly a movie with few edges can be similar to some movies which are not connected to them directly, but there might be a short path in the network which connects them. 

In order for making a link in the network, some similarities between the movies (nodes) must exist. It was decided that what mattered in order to make a link between the networks was the number of common: keywords, genres, actors/actresses and directors. In addition to that we looked at IMDb ratings and how similar the storylines of movies are.

The notebook contains detailed description on how we created the network and what was the main idea behind the network. We also did the same analysis on this network as we did for the networks in the 'Network' section. 

### Outcome 

The result actually exceeded our expectations, we were really happy with the end product and we'll definitely be using this engine in the x-mas break. This part was mainly focused on putting together bits and pieces from our previous work, we spent some time on deciding the importance of the different properties of the movies and this was done mainly through trial-and-error. We'll further reflect on the outcome in the 'Discussion' section below.

# Discussion


We are all huge movie fans and therefore we had a lot of fun doing this project. We did a lot of analysis on the data like you have probably noticed. We are quite happy with our findings and we are really proud of the final version of the recommendation engine. The recommendation engine is functioning quite well and we'll definitely be using it ourselves during the x-mas break when looking for movies to watch and we encourage you to try it out on the website.

Although it turned out great we tried quite a lot of different things before we were satisfied. For the network part we looked at three different networks. We created a network where movies were connected if they shared a certain percentage of their keywords. We did the same for genres as well as one network where nodes were cast members and nodes were linked if the cast members had starred together in a certain number of movies. After analyzing each of the networks we decided that none of them were good enough. This resulted in us creating a fourth network which would combine the properties of the previous three. Movies were nodes and to create links to other movies we looked at multiple properties e.g. cast members, keywords, genres, directors and similarities in story line. After links had been created and we’re in the process of recommending we also looked at IMDb ratings. 

The text analysis itself went well but it ended up not being used in the recommendation engine except for the storylines of the movies. The reason for this was that the data we had for text analysis, ~14.000 reviews and ~1.000 movie scripts, only applied to a subset of our 4800 movies. We didn't want to make the number of movies that we could recommend less than it was at the beginning. We did a lot of text analysis and made a proof of concept that the data could've been used in the recommendation engine. This tells us that we could have further improved the engine by getting manuscripts and reviews for all our movies. If not the manuscript then we could've perhaps gotten the subtitle texts for each movie and done the same analysis on those texts and used the information to further improve our engine. 

**Retrospect, what could have been done differently**

- We encountered some problems when scraping the IMDb website because we where basing our scraping on the name of the movie. When we look back we should have found a way to map the TMDb IDs to the IMDb IDs and use that mapping to find the corresponding IMDb websites. The difficulties arose because the names of the movies can't be considered as a unique ID. Let's illustrate this with an example; Avatar is a movie that most of us know and was directed by the well known director James Cameron, but Avatar (Avatar: The Last Airbender) is also an animated TV series that has an impressive IMDb rating of 9.2. When we scraped the internet we actually got this TV series' rating for Avatar but not the James Cameron version which was the movie that we where looking for. This did however not seem to happen often and we manage to fix most if not all of the errors but it took a lot of manual work and it could have been avoided by simply mapping the TMDb IDs to the IMDb IDs. Therefore the lesson learned is to always use unique IDs when doing these kind of data gathering and analysis.

- Like we have mentioned multiple times here in this notebook, we failed to gather manuscripts for all movies. However the subtitles for each movie can be found on the world wide web and in retrospect it could have been a better idea to download the subtitles for each movie and do text analysis on that data. In that way we could've gathered text for each movie and used the finding to improve our recommendation engine further. 

** Future work**

What we want to do next is to scale this assignment up. What we want to do can be summarized in the follwing bullets.
 - Our recommendation engine is limited to the amount of movies that we have in the dataset. [The movie database](https://www.themoviedb.org) contains information about ~400000 movies and we would like to expand the capabilities of our recommendation engine by including more movies into the dataset. 
 - We would like to include the sentiment analysis and/or the similarity measures of the manuscripts in our recommendation engine so that movies with similar mean sentiment value and that contain similiar manuscripts will become more likely to have smaller distances between them in our network. Like we have discussed in the assignment, we have already made some proof of concepts and the only thing standing in our way is the problem of finding manuscripts for each movie. If it isn't possible to get the manuscripts for the majority of the movies then we would like to include analysis of the subtitles of the movies in our recommendation engine. 
 - We did not create any backend service for the project (website). It would be a good idea to create a backend service for the website and host it on some external server, e.g Heroku or Amazon.

###############################################################################################################################

# References

https://en.wikipedia.org/wiki/Tf%E2%80%93idf - tf–idf wikipedia page

https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm - Dijkstra's algorithm wikipedia page

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0026752#s2 - Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter

http://barabasi.com/networksciencebook/ - Network Science by Albert-László-Barabási

http://www.imsdb.com/ - The Internet Movie Script Database (IMSDb)

http://www.imdb.com/ - The Internet Movie Database (IMDb)

https://www.themoviedb.org  - The movie database

http://perso.crans.org/aynaud/communities/ - Python Louvain algorithm

