# Project Instructions

Your project is to create a module named `moviedb` for managing a movie collection. This is to be done by SLT. Your code should be committed on a private repo in GitHub with repo name `moviedb` then submit this notebook by 16 June at 8pm. Only one member of the SLT should submit--there will be a penalty for submissions from multiple members of an SLT. The module should be at the top-level directory of this repo. Grant read access to this repo to Christian Alis (GitHub account `ianalis`). **Do not write your code on this notebook nor submit it along with this notebook**. Just specify your repo url in the cell below.

Only the following packages may be used for the implementation:

* Python standard libraries
* Numpy (but not scipy)
* Pandas

## The `MovieDB` class

The module should contain one class named `MovieDB`. It should have the following specifications:

### Initialization

The class initializer should accept a string `data_dir` which is the directory path to where the data files are located. This path should be stored in the `data_dir` attribute of the object.

### Data persistence

Data are stored in the CSV files `movies.csv` and `directors.csv` in `data_dir`. The CSV files begin with a column header and the columns of each CSV file are:

`movies.csv`:
  - `movie_id`: an integer assigned to the movie
  - `title`: movie title. It is enclosed in double quotes (") when stored in the CSV file if there is a comma (,) in the title
  - `year`: release year
  - `genre`: movie genre
  - `director_id`: id of the movie's director
  
`directors.csv`:
 - `director_id`: an integer assigned to the director
 - `given_name`: given name of the director
 - `last_name`: last name of the director

### Exception

The class has the associated exception `MovieDBError` which is a `ValueError`.

### Features

#### Adding a movie to the database

Create a method `add_movie` that accepts the following parameters:
  - `title`: title of the movie
  - `year`: year movie was released as integer
  - `genre`: genre of the movie
  - `director`: director of the movie as a string in `Last name, Given name` format
  
  The method should append the movie to the end of `movies.csv`, if it exists, or creates it, otherwise. The `movie_id` is _last `movie_id` in the file_ + 1, or `1` if there's no movie in the file yet. The `director_id` is the corresponding `director_id` in `directors.csv` based on the case-insensitive matches of `given_name` and `last_name`. It should append the director in `directors.csv` if the director is not yet there. The `director_id` of a new director is _last `director_id` in the file_ + 1, or `1` if there's no director in the file yet. The method should return the `movie_id` or raise a `MovieDBError` exception if the movie is already in `movies.csv`. A movie is said to be in `movies.csv` if there is a movie that matches the `title` (case-insensitive), `year`, `genre` (case-insensitive) and `director` (case-insensitive).

#### Adding movies in the database

Create a method `add_movies` that accepts a list of movies in the form of dictionaries with the following keys:
  - `title`: title of the movie
  - `year`: year movie was released as integer
  - `genre`: genre of the movie
  - `director`: director of the movie as a string in `Last name, Given name` format
  
  The method should add each movie to the database. It returns a list of the `movie_id`s of successfully added movies. If a movie is already in the database, skip it and print `Warning: movie {title} is already in the database. Skipping...` instead. If a movie has invalid or incomplete information, skip it and print `Warning: movie index {i} has invalid or incomplete information. Skipping...` instead. The movie index is the zero-based index of the movie in the passed list of movies.
  
#### Deleting a movie in the database

Create a method `delete_movie` that accepts the `movie_id` to delete then removes it from `movies.csv`. It will raise a `MovieDBError` if the `movie_id` is not found.
  
#### Searching for movies in the database

Create a method `search_movies` that accepts the following keyword arguments:
  - `title`: case-insensitive title of the movie
  - `year`: year movie was released as integer
  - `genre`: case-insensitive genre of the movie
  - `director_id`: id of the director of the movie

All of the arguments are optional but there should be at least one nontrivial argument passed to the method. It should raise a `MovieDBError` if there is no nontrivial argument that was passed. It should return the list of matching `movie_id`s.

#### Exporting data

Create a method `export_data` that returns all of the movies in the database as a pandas data frame with the following columns:
  - `title`: title of the movie
  - `year`: year movie was released as integer
  - `genre`: genre of the movie
  - `director_last_name`: last name of the movie director
  - `director_given_name`: given name of the movie director
  
Sort the rows by the corresponding `movie_id` of each movie.
  
#### Generating statistics

Create a method `generate_statistics` that returns a dictionary depending on the `stat` parameter passed to it:
  - `movie`: key is year, value is the number of movies for that year
  - `genre`: key is each unique genre, value is another dictionary with year as key and number of movies of that genre for that year as value
  - `director`: key is director name following the format `Last name, Given name`, value is another dictionary with year as key and number of movies of that director for that year as value
  - `all`: keys are `movie`, `genre` and `director`, values are the corresponding dictionary returned by those keywords

The `stat` values are case-sensitive and the method should raise `MovieDBError` if the passed `stat` is unknown.

#### Plotting statistics

Create a method `plot_statistics` that returns a matplotlib `Axes` depending on the `stat` parameter passed to it:
  - `movie`: bar plot of number of movies per year
  - `genre`: superimposed circle and line plots of the number of movies for each genre per year, one line per genre. Sort alphabetically by genre then show the legend.
  - `director`: superimposed circle and line plots of the number of movies for each director per year, one director per line. Sort the directors by decreasing number of movies then by increasing last name. Plot only the 5 directors with the most movies.
  
The `stat` values are case-sensitive and the method should raise `MovieDBError` if the passed `stat` is unknown.
  
#### Token frequency

Create a method `token_freq` that returns a dictionary with the token as key and the number of times that word appeared in all of the titles as value. A token is defined as a case-folded sequence of non-whitespace characters. 


## Grading guide

* The project has a highest possible score of 150 points.

* Each cell with an assert statement is worth 10 pts. Successfully passing all of the tests in a cell will earn you the entire 10 pts. Failure to pass any of the test in the cell, including hidden tests, will earn no point. No partial points will be given thus make sure that you run and pass all the visible tests in the test suite before submitting.

* Successful git cloning is worth 15 pts. Successful importing of the module is worth 5 pts. 

* If the module fails to clone or import, the professor will attempt to make it work but will merit additional deductions up to 10% of highest possible score.

* Methods should have a sensible docstring. The professor will deduct up to a total of 15 pts for missing, misleading or nonsensible docstrings. If you reasonably follow the numpy docstring format then you will likely not receive any deductions.

* The code should follow PEP8. The professor will run your python codes through [pycodestyle](https://pypi.org/project/pycodestyle/) and will deduct a point up to a total of 15 points for every instance of PEP8 violation (including warning).

**Hints**: 
* Instead of figuring out how to modify a specific line in the CSV files, recreate the entire CSV file then overwrite the original file.
* Use `os.path` for path operations.

In [None]:
# THIS IS THE ONLY CELL THAT YOU WILL MODIFY IN THIS NOTEBOOK.
# Store the SSH clone URL for your `pgalyzer` repo as a string
git_repo_url = ''
# YOUR CODE HERE
raise NotImplementedError()

# Automated tests

## Cloning

In [None]:
import pickle
import shutil
import os
import pandas as pd
from tempfile import TemporaryDirectory
from numpy.testing import assert_equal, assert_almost_equal, assert_raises

In [None]:
#### Test clone and pip install
# The tests here will run `git clone {git_repo_url} moviedb` then copy all
# of the repo contents back to the directory where this notebook is

## Code style

PEP 8 violations and warnings:

In [2]:
!find . -iname "*.py" | xargs pycodestyle | wc -l

'xargs' is not recognized as an internal or external command,
operable program or batch file.


## The `MovieDB` class

In [1]:
from moviedb import MovieDB, MovieDBError

### Initialization

In [None]:
movie_db = MovieDB('.')
assert_equal(movie_db.data_dir, '.')


### Adding a movie to the database

In [None]:
with TemporaryDirectory() as temp_dir:
    movie_db = MovieDB(temp_dir)
    assert_equal(
        movie_db.add_movie('Shrek', 2001, 'Comedy', 'Adamson, Andrew'), 
        1
    )
    assert_raises(
        MovieDBError,
        lambda: movie_db.add_movie('Shrek', 2001, 'Comedy', 'Adamson, Andrew')
    )
    df_movies = pd.read_csv(os.path.join(temp_dir, 'movies.csv'))
    assert_equal(
        df_movies.columns.tolist(),
        ['movie_id', 'title', 'year', 'genre', 'director_id']
    )
    assert_equal(
        df_movies.values.tolist(), 
        [[1, 'Shrek', 2001, 'Comedy', 1]]
    )
    assert_equal(
        df_movies.index.tolist(), 
        [0]
    )
    df_directors = pd.read_csv(os.path.join(temp_dir, 'directors.csv'))
    assert_equal(
        df_directors.columns.tolist(),
        ['director_id', 'given_name', 'last_name']
    )
    assert_equal(
        df_directors.values.tolist(), 
        [[1, 'Andrew', 'Adamson']]
    )
    assert_equal(
        df_directors.index.tolist(), 
        [0]
    )

In [None]:
with TemporaryDirectory() as temp_dir:
    movie_db = MovieDB(temp_dir)
    assert_equal(
        movie_db.add_movie('The Matrix', 1999, 'Action', 'Wachowskis, The'), 
        1
    )
    assert_equal(
        movie_db.add_movie('Shrek', 2001, 'Comedy', 'Adamson, Andrew'), 
        2
    )
    assert_equal(
        movie_db.add_movie('The Matrix Reloaded', 2003, 'Action', 
                           'Wachowskis, The'), 
        3
    )
    assert_equal(
        movie_db.add_movie('The Matrix Revolutions', 2003, 'Action', 
                           'Wachowskis, The'), 
        4
    )
    assert_raises(
        MovieDBError,
        lambda: movie_db.add_movie('Shrek', 2001, 'Comedy', 'Adamson, Andrew')
    )

In [None]:
with TemporaryDirectory() as temp_dir:
    movie_db = MovieDB(temp_dir)
    assert_equal(
        movie_db.add_movie('Spider-Man', 1977, 'Action', 
                           'Swackhamer, Egbert Warnderink'), 
        1
    )
    assert_equal(
        movie_db.add_movie('Spider-Man', 2002, 'Action', 'Raimi, Sam'),
        2
    )
    assert_equal(
        movie_db.add_movie('Spider-Man', 2002, 'Animation', 'Raimi, Sam'),
        3
    )
    assert_raises(
        MovieDBError,
        lambda: movie_db.add_movie('Spider-Man ', 2002, ' Action', 
                                   ' Raimi,  Sam ')
    )

### Adding movies in the database

In [None]:
%%capture out
with TemporaryDirectory() as temp_dir:
    movie_db = MovieDB(temp_dir)
    movie_db.add_movies([
        {'title': 'The Matrix', 'year': 1999, 'genre': 'Action', 
         'director': 'Wachowskis, The'}, 
        {'title': 'Shrek',  'genre': 'Comedy', 'director': 'Adamson, Andrew', 
         'year': 2001},
        {'year': 2003, 'genre': 'Action', 'title': 'The Matrix Reloaded'},
        {'title': 'Shrek', 'year': 2001, 'genre': 'Comedy', 
         'director': 'Adamson, Andrew'},
        {'title': 'The Matrix Revolutions','year': 2003, 'genre': 'Action', 
         'director': 'Wachowskis, The'}
    ])

In [None]:
assert_equal(
    out.stdout,
    'Warning: movie index 2 has invalid or incomplete information. '
    'Skipping...\n'
    'Warning: movie Shrek is already in the database. Skipping...\n'
)

### Deleting a movie in the database

In [None]:
with TemporaryDirectory() as temp_dir:
    movie_db = MovieDB(temp_dir)
    assert_raises(MovieDBError, lambda: movie_db.delete_movie(1))
    shutil.copy('movies_test.csv', os.path.join(temp_dir, 'movies.csv'))
    shutil.copy('directors_test.csv', os.path.join(temp_dir, 'directors.csv'))
    movie_db.delete_movie(1)
    assert_raises(MovieDBError, lambda: movie_db.delete_movie(1))

### Searching for movies in the database

In [None]:
with TemporaryDirectory() as temp_dir:
    movie_db = MovieDB(temp_dir)
    assert_raises(MovieDBError, lambda: movie_db.search_movies())
    assert_equal(movie_db.search_movies(title='Spider-man'), [])
    shutil.copy('movies_test.csv', os.path.join(temp_dir, 'movies.csv'))
    shutil.copy('directors_test.csv', os.path.join(temp_dir, 'directors.csv'))
    assert_raises(MovieDBError, lambda: movie_db.search_movies())
    assert_equal(movie_db.search_movies(title='Spider-man'), [5, 6, 7])
    assert_equal(movie_db.search_movies(year=2002), [6, 7])

### Exporting data

In [None]:
with TemporaryDirectory() as temp_dir:
    movie_db = MovieDB(temp_dir)
    df_export = movie_db.export_data()
    assert_equal(isinstance(df_export, pd.DataFrame), True)
    assert_equal(
        df_export.columns.tolist(), 
        ['title', 'year', 'genre', 'director_last_name', 
         'director_given_name']
    )
    assert_equal(len(df_export), 0)
    shutil.copy('movies_test.csv', os.path.join(temp_dir, 'movies.csv'))
    shutil.copy('directors_test.csv', os.path.join(temp_dir, 'directors.csv'))
    df_export = movie_db.export_data()
    assert_equal(isinstance(df_export, pd.DataFrame), True)
    assert_equal(
        df_export.columns.tolist(), 
        ['title', 'year', 'genre', 'director_last_name', 
         'director_given_name']
    )
    assert_equal(
        df_export.iloc[:3].values.tolist(),
        [['The Matrix', 1999, 'Action', 'Wachowskis', 'The'],
         ['Shrek', 2001, 'Comedy', 'Adamson', 'Andrew'],
         ['The Matrix Reloaded', 2003, 'Action', 'Wachowskis', 'The']]
    )

### Generating statistics

In [None]:
with TemporaryDirectory() as temp_dir:
    movie_db = MovieDB(temp_dir)
    assert_raises(MovieDBError, lambda: movie_db.generate_statistics('Movie'))
    assert_equal(movie_db.generate_statistics('movie'), {})
    assert_equal(movie_db.generate_statistics('genre'), {})
    assert_equal(movie_db.generate_statistics('director'), {})
    assert_equal(
        movie_db.generate_statistics('all'), 
        {'movie': {}, 'genre': {}, 'director': {}}
    )
    shutil.copy('movies_test2.csv', os.path.join(temp_dir, 'movies.csv'))
    shutil.copy('directors_test2.csv', 
                os.path.join(temp_dir, 'directors.csv'))
    stats_movie = movie_db.generate_statistics('movie')
    assert_equal(len(stats_movie), 31)
    assert_equal(stats_movie[2007], 220)
    assert_equal(stats_movie[2016], 220)

In [None]:
with TemporaryDirectory() as temp_dir:
    movie_db = MovieDB(temp_dir)
    shutil.copy('movies_test2.csv', os.path.join(temp_dir, 'movies.csv'))
    shutil.copy('directors_test2.csv', 
                os.path.join(temp_dir, 'directors.csv'))
    stats_genre = movie_db.generate_statistics('genre')
    assert_equal(len(stats_genre), 17)
    assert_equal(len(stats_genre['Adventure']), 31)
    assert_equal(
        stats_genre['Adventure'],
        {2009: 19, 2005: 19, 1995: 19, 1986: 16, 1994: 16, 2016: 16, 2012: 15,
         2007: 15, 1989: 15, 1992: 14, 2013: 14, 2010: 14, 2006: 14, 2014: 13,
         1993: 12, 2011: 12, 2008: 12, 2001: 12, 2000: 12, 2003: 11, 1990: 11,
         2015: 11, 1997: 10, 1996: 10, 2002: 9, 1987: 9, 1988: 9, 1991: 9,
         1999: 8, 1998: 8, 2004: 8}
    )
    stats_director = movie_db.generate_statistics('director')
    assert_equal(len(stats_director), 2753)
    assert_equal(
        stats_director['Spielberg, Steven'],
        {1989: 2, 1993: 2, 1997: 2, 2002: 2, 2005: 2, 2011: 2, 2016: 1,
         1987: 1, 1991: 1, 1998: 1, 2001: 1, 2004: 1, 2008: 1, 2012: 1,
         2015: 1}
    )
    stats_all = movie_db.generate_statistics('all')
    assert_equal(set(stats_all.keys()), {'movie', 'genre', 'director'})
    assert_equal(len(stats_all['movie']), 31)
    assert_equal(stats_all['movie'][2007], 220)
    assert_equal(stats_all['movie'][2016], 220)
    assert_equal(len(stats_all['genre']), 17)
    assert_equal(len(stats_all['genre']['Adventure']), 31)
    assert_equal(
        stats_all['genre']['Adventure'],
        {2009: 19, 2005: 19, 1995: 19, 1986: 16, 1994: 16, 2016: 16, 2012: 15,
         2007: 15, 1989: 15, 1992: 14, 2013: 14, 2010: 14, 2006: 14, 2014: 13,
         1993: 12, 2011: 12, 2008: 12, 2001: 12, 2000: 12, 2003: 11, 1990: 11,
         2015: 11, 1997: 10, 1996: 10, 2002: 9, 1987: 9, 1988: 9, 1991: 9,
         1999: 8, 1998: 8, 2004: 8}
    )
    assert_equal(len(stats_all['director']), 2753)
    assert_equal(
        stats_all['director']['Spielberg, Steven'],
        {1989: 2, 1993: 2, 1997: 2, 2002: 2, 2005: 2, 2011: 2, 2016: 1,
         1987: 1, 1991: 1, 1998: 1, 2001: 1, 2004: 1, 2008: 1, 2012: 1,
         2015: 1}
    )

### Plotting statistics

In [None]:
with TemporaryDirectory() as temp_dir:
    movie_db = MovieDB(temp_dir)
    shutil.copy('movies_test2.csv', os.path.join(temp_dir, 'movies.csv'))
    shutil.copy('directors_test2.csv', 
                os.path.join(temp_dir, 'directors.csv'))
    ax_movie = movie_db.plot_statistics('movie')
    ax_movie.get_figure().canvas.draw()
    assert_equal(
        [t.get_text() for t in ax_movie.get_xticklabels()],
        ['1980', '1985', '1990', '1995', '2000', 
         '2005', '2010', '2015', '2020']
    )
    assert_equal(
        [t.get_text() for t in ax_movie.get_yticklabels()],
        ['0', '50', '100', '150', '200', '250']
    )
    assert_equal(ax_movie.get_ylabel(), 'movies')
    assert_equal(
        [h.get_height() for h in ax_movie.patches],
        [219, 219, 219, 220, 219, 220, 220, 220, 220, 220, 220, 219, 220, 220,
         219, 220, 220, 219, 219, 220, 219, 220, 220, 219, 220, 219, 218, 220,
         219, 220, 220]
    )
    ax_genre = movie_db.plot_statistics('genre')
    ax_genre.get_figure().canvas.draw()
    assert_equal(
        [t.get_text() for t in ax_genre.get_xticklabels()],
        ['1980', '1985', '1990', '1995', '2000', 
         '2005', '2010', '2015', '2020']
    )
    assert_equal(
        [t.get_text() for t in ax_genre.get_yticklabels()],
        ['−20', '0', '20', '40', '60', '80', '100']
    )
    assert_equal(ax_genre.get_ylabel(), 'movies')
    assert_equal(len(ax_genre.lines), 17)
    assert_equal(
        ax_genre.lines[0].get_xdata().tolist(),
        [1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996,
         1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007,
         2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016]
    )
    assert_equal(
        ax_genre.lines[0].get_ydata().tolist(),
        [49, 47, 35, 46, 46, 42, 37, 34, 33, 42, 35, 54, 39, 30, 33, 34, 42,
         51, 40, 44, 35, 36, 52, 35, 52, 49, 45, 52, 47, 47, 61]
    )
    assert_equal(ax_genre.lines[0].get_marker(), 'o')
    assert_equal(ax_genre.lines[0].get_ls(), '-')
    assert_equal(
        [t.get_text() for t in ax_genre.legend_.get_texts()[:5]],
        ['Action', 'Adventure', 'Animation', 'Biography', 'Comedy']
    )
    ax_director = movie_db.plot_statistics('director')
    ax_director.get_figure().canvas.draw()
    assert_equal(
        [t.get_text() for t in ax_director.get_xticklabels()],
        ['1980', '1985', '1990', '1995', '2000', 
         '2005', '2010', '2015', '2020']
    )
    assert_equal(
        [t.get_text() for t in ax_director.get_yticklabels()],
        ['0.8', '1.0', '1.2', '1.4', '1.6', '1.8', '2.0', '2.2']
    )
    assert_equal(ax_director.get_ylabel(), 'movies')
    assert_equal(len(ax_director.lines), 5)
    assert_equal(
        ax_director.lines[0].get_xdata().tolist(),
        [1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996,
         1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007,
         2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016]
    )
    assert_equal(
        ax_director.lines[0].get_ydata().tolist(),
        [1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1]
    )
    assert_equal(ax_director.lines[0].get_marker(), 'o')
    assert_equal(ax_director.lines[0].get_ls(), '-')
    assert_equal(
        [t.get_text() for t in ax_director.legend_.get_texts()[:3]],
        ['Allen, Woody', 'Eastwood, Clint', 'Soderbergh, Steven']
    )

### Token frequency

In [None]:
with TemporaryDirectory() as temp_dir:
    movie_db = MovieDB(temp_dir)
    assert_equal(movie_db.token_freq(), {})
    shutil.copy('movies_test2.csv', os.path.join(temp_dir, 'movies.csv'))
    shutil.copy('directors_test2.csv', 
                os.path.join(temp_dir, 'directors.csv'))
    token_freq = movie_db.token_freq()
    assert_equal(len(token_freq), 6534)
    assert_equal(token_freq['the'], 1874)
    assert_equal(token_freq['spider-man'], 5)