# DAML Project - Movie Directors

The IMBD website makes available a subset of its data on movies on it's [IMDB interfaces][imdbif]
section.  This data is a dump of the websites database tables as CSV/TSV files, and it is
licensed for personal and non-commercial use.  It is quite a big lump of data and therefore
a good example of real-world data exploration and machine learning application.  Let's try to
do something useful with this dataset.

[imdbif]: http://www.imdb.com/interfaces/ "interfaces section"

## The Data

Files are available at the [IMDB datasets][dsimdb] and come zipped with GNU-zip (`gzip`).
`pandas`' `read_csv` is capable of inferring the correct python module to use to open the file,
otherwise you can pass it the `compression='gzip'` argument.  To see the content of the files
without `pandas`, on Linux/Mac running `gunzip` on the file should uncompress it, on Microsoft
Windows most free unzipping programs will have a `gzip` option.

[dsimdb]: https://datasets.imdbws.com/ "datasets"

It is important to remember that the data comes from SQL tables.  And, therefore, the two most
important columns will are `tconst` (for movies) and `nconst` (for people), these are
the primary keys and it will be required to join the data over them.  Not all files from
the database will be needed, deciding how to approach which files are needed is part
of the problem (but see the hints below).

Also note that the data is dirty to some extent.  We can find absurd movies in the dataset,
e.g. movies in the future or without a cast.  Be careful and clean it up.  Also note that
IMBD is alist of movies *and* series, and it stores the series as episodes.  Be careful
to not count the series or not count every episode with the same value as a full movie.

## The Problem

We will divide the problem to solve it in three levels.  It is advisable to solve the first level,
the second level is meant for everyone who enjoys an extra challenge, and the third level is
even more challenging and not necessarily viable on a first or second attempt (read: I have not
tried it yet but should be possible).  Anyhow, on to the problem.

Most movie directors have a preferred group of actors with whom they like to work with.
Moreover, they work in specific genres of movies, or make series, short films as movies.
We will attempt to create a **movie director classifier**, which will tell us the director's
name based only on the data that is available on IMDB.  In other words, given:

- title
- year
- running time
- genre(s)
- cast
- writers

Of the movie, our classifier will tell us the movie director.
How well it needs to perform depends on which level of the problem you are trying to solve.

Note:  To consider a classifier built is to have it cross-validated across a grid of possible
hyperparameters, and presenting a decent accuracy (e.g. F1) score.

### Level 1

Given the data about the movie the classifier must be able to tell if the movie was
directed by [Quentin Tarantino][quentin] or not.

[quentin]: https://en.wikipedia.org/wiki/Quentin_Tarantino "wikipedia"

### Level 2

Build a list of directors that have made at least 7 movies and create a classifier
to distinguish between these movie directors.

### Level 3

Now that we have a classifier for directors let's try to find the centre of each class
in the high dimensional space and define relative distances.  For example, a way to tell
whether [Roman Polanski][polanski]'s work is closer to Quentin Tarantino's
or [Steven Spielberg][spielberg]'s.

[polanski]: https://en.wikipedia.org/wiki/Roman_Polanski "wikipedia"
[spielberg]: https://en.wikipedia.org/wiki/Steven_Spielberg "wikipedia"

Finally cluster around the classes of directors with at least 7 movies directors who
have less movies but do similar films (where similar means: according to out data).
This tries to answer which aspiring new directors may become a specific genre big name,
the specific genre being the one from the big director they are currently similar to.

## Hints

-   Try different classifiers!  From simple ones advancing to more complex ones.
    This is a more-or-less real world and a difficult problem, do not expect accuracy
    of your classifier to be in the 90% range.  More than 80% accuracy is good enough.

-   This project is a case of very heterogeneous features across a lot of data
    but not much data that can be used by a classifier trainer.  This scenario
    is often true in the real world.

-   Most of the data is categorical in nature (names of people), you will be working
    with sparse matrices of zeros and ones most of the time.  Transform your data into
    these matrices after filtering out the data you do not need.  Changing every actor's
    name into a dummy-variable column is likely to bust out the memory of the most
    capable machines.

-   Remember that the data is very, very high-dimensional.  And, due to the big amount
    of categorical variables most of the decision boundaries are aligned to the axes
    in some dimension.

-   Several directors are often writers on their movies, several despise the idea.
    The `writer` feature will be very important.

-   Quentin Tarantino is known for working with a very small and selected group of actors.

-   A classifier is trained badly with classes of different magnitudes, i.e. one class
    with 10 samples and another with 10 000 samples.  When building the training set
    try to even out the class magnitude.  This means that for the classifier on the
    level 1, you probably will get best results with a total training set of less than
    100 points.

-   With datasets that are quite small (as most training sets will be here) the
    cross-validation technique should be leave-one-out.

-   When building a list of directors you may want to fine-tune it by hand.
    It is completely fine to limit your classifier to Hollywood-only movies.
    IMDB contains movies from several countries, which can make the classifier
    work based on a country boundary by guessing across the cast for names that
    are known in one country only.  This may make the classifier work worse
    between directors in the same country.

-   When building the clusters it may be wise to cluster only directors with at least 2 movies
    on their making.  This allows you to try more expensive techniques and makes it easier
    to avoid dirty data.