# Movie Recommendation Engine 

## What this does
This engine recommends the movie which has the most similar plot from approximately 770 titles to the one you input.

登録されている約770作品の映画の中から、入力された作品にもっとも近いあらすじをもつ作品を返します。


### Data Setup
Retrieves the list of movie titles from the box office ranking top 770 by web scraping.

興行収入成績トップの映画作品770タイトルをスクレイピングにより取得します。

In [26]:
# coding: UTF-8
import urllib3
from bs4 import BeautifulSoup

http = urllib3.PoolManager()
with open('titles.txt', 'w') as file:
    for num in range(1, 9):
        url = "http://www.boxofficemojo.com/alltime/world/"
        if num > 1:
            url = url + '?pagenum=' + str(num) + '&p=.htm'
        response = http.request('GET', url) 
        soup = BeautifulSoup(response.data, "html.parser")
        rows = soup.select('tr')

        for index, row in enumerate(rows):
            if index < 3:
                continue
            title = row.select_one('a')
            if title:
                file.write(title.string)
                file.write('\n')

http://www.boxofficemojo.com/alltime/world/




http://www.boxofficemojo.com/alltime/world/?pagenum=2&p=.htm




http://www.boxofficemojo.com/alltime/world/?pagenum=3&p=.htm




http://www.boxofficemojo.com/alltime/world/?pagenum=4&p=.htm




http://www.boxofficemojo.com/alltime/world/?pagenum=5&p=.htm




http://www.boxofficemojo.com/alltime/world/?pagenum=6&p=.htm




http://www.boxofficemojo.com/alltime/world/?pagenum=7&p=.htm




http://www.boxofficemojo.com/alltime/world/?pagenum=8&p=.htm




### DB Setup
Create a table to store the movie title and plot data.

映画のタイトルとあらすじを格納するためのテーブルを作成します。

In [28]:
# create table
import sqlite3

dbname = 'recommend.db'
conn = sqlite3.connect(dbname)

conn.execute('drop table if exists movies')
conn.execute('create table movies(id, title, plot)')
conn.commit()
conn.close()


### Get plot data using API
Using the movie titles retrieved at Data Setup step, get the plot of each title by requesting to OMDb API.

データセットアップで取得した映画タイトルを使って、OMDb APIからあらすじデータを取得します。

In [30]:
# coding: UTF-8
import urllib3
import json
import re
import sqlite3

with open('apikey.txt', 'r') as file:
    apikey = file.read()
    http = urllib3.PoolManager() 
    base_url = "http://www.omdbapi.com/?apikey=" + apikey + "&plot=full&t="

    dbname = 'recommend.db'
    conn = sqlite3.connect(dbname)
    id = 1

    with open('titles.txt', 'r') as file:
        titles = file.read().splitlines()
        for title in titles:
            title = re.sub('\(\d+\)', '', title)
            title = title.replace('&', '%26')
            title = title.replace(' ', '+')
            url = base_url + title
            response = http.request('GET', url)
            data = json.loads(response.data)
            if "Title" not in data:
                print(url)
            else:
                t =  data['Title'].replace('\'', '')
                plot = data['Plot'].replace('\'', '')
                conn.execute("insert into movies(id, title, plot) values (%d, '%s', '%s')" % (id, t, plot))
                id += 1

    conn.commit()
    conn.close()

http://www.omdbapi.com/?i=tt3896198&apikey=1e73269b&plot=full&t=Knight+%26+Day
http://www.omdbapi.com/?i=tt3896198&apikey=1e73269b&plot=full&t=Marley+and+Me


### Recommendation
Recommend a movie which has the most similar plot to the user input.
Movie plots can be evaluated by vectorizing using tf-idf.
The similarity between two vectors can be calculated by cosine_similarity.
recommend_engine class calculates the cosine_similarity of the input movie title and all stored titles then returns the best result.

ユーザが入力した映画タイトルをもとに、最も似たあらすじをもつ映画を返します。
映画のあらすじはtf-idfを使うことでベクトルとして表現できます。
ベクトル同士の類似度はコサイン類似度を用いて計算が可能です。
recommend_engineクラスは、与えられた映画とDBに保持している映画とのコサイン類似度を計算し、もっとも高いスコアのものを返します。

In [99]:
%precision %.2f

import urllib3
import json
import sqlite3
import re
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

class recommend_engine:
    def __init__(self, dbname, apikey):
        self.conn = sqlite3.connect(dbname)
        self.apikey = apikey
        
    def __delete__(self):
        self.conn.close()

    # Escape the white spaces
    def __format_title(self, title):
        return title.replace(' ', '%20')

    # Delete numbers as they are valueless in vectorization
    def __format_plot(self, plot):
        return re.sub('\d', '', plot)
        
    def __fetch_plot(self, title):
        http = urllib3.PoolManager()
        base_url = "http://www.omdbapi.com/?apikey=" +  self.apikey + "&plot=full&t="

        url = base_url + self.__format_title(title)
        response = http.request('GET', url) 
        data = json.loads(response.data)
        
        if 'Plot' not in data:
            return None
        
        plot =  data['Plot'].replace('\'', '')
        return plot
        
    def __find_most_similar(self, vecs):
        target_vec = vecs[-1]
        best_score = 0
        best_index = -1

        length = vecs.shape[0]

        for index, vec in enumerate(vecs):
            #　Break as the last one is the user input
            if index == length - 1:
                break

            score = cosine_similarity(target_vec, vec)
            # Exclude the same movie as the user input
            if score < 1 and score > best_score:
                best_score = score
                best_index = index
        
        if best_index == -1:
            # "We can't find any recommended movies. Please try a different title."
            raise Exception("オススメの映画が見つかりませんでした。違うタイトルを試してください。")
            
        return (best_index, best_score)

    def recommend(self, title):
        data = self.conn.execute('select * from movies')
        data_list = data.fetchall()
        title_plots = [(row[1], row[2]) for row in data_list]

        plot = self.__fetch_plot(title)
        if plot is None:
            # "We can't get the movie data. Please try a different title."
            raise Exception('作品データが取得できませんでした。違うタイトルを試してください。')
        
        title_plots.append((title, plot))

        vectorizer = TfidfVectorizer()
        vecs = vectorizer.fit_transform([self.__format_plot(p[1]) for p in title_plots])

        (best_index, best_score) = self.__find_most_similar(vecs)
        return title_plots[best_index]

with open('apikey.txt', 'r') as file:
    engine = recommend_engine('recommend.db', file.read())
    # Input your favorite movie title in English.
    title = input('好きな映画のタイトルを英語で入力してください')

    try:
        (recommended_title, recommended_plot) = engine.recommend(title)
        print()
        # This is our recommend for you!
        print(title + 'が好きなあなたへのオススメはこちら！')
        print('Title: ' + recommended_title)
        print('Plot: ' + recommended_plot)
    except Exception as e:
        print(e)

好きな映画のタイトルを英語で入力してくださいLife Is Beautiful

Life Is Beautifulが好きなあなたへのオススメはこちら！
Title: Inglourious Basterds
Plot: In German-occupied France, young Jewish refugee Shosanna Dreyfus witnesses the slaughter of her family by Colonel Hans Landa. Narrowly escaping with her life, she plots her revenge several years later when German war hero Fredrick Zoller takes a rapid interest in her and arranges an illustrious movie premiere at the theater she now runs. With the promise of every major Nazi officer in attendance, the event catches the attention of the "Basterds", a group of Jewish-American guerrilla soldiers led by the ruthless Lt. Aldo Raine. As the relentless executioners advance and the conspiring young girls plans are set in motion, their paths will cross for a fateful evening that will shake the very annals of history.
