# TF-IDF modeling with limited resources

In [None]:
import graphlab

### Loading the data from the IMDB crawer

In [90]:
# load the data from the Imdb Data gathered by the crawler
imdb_frame = graphlab.SFrame('Imdb Data')

We see that there are 9700 entries in the data set, let's print few first ones

In [91]:
len(imdb_frame)

9700

In [92]:
print(imdb_frame.head())

+-------------------------------+---------------------------+
|          description          |          director         |
+-------------------------------+---------------------------+
| The lives of two mob hit m... |     Quentin Tarantino     |
| When New York is put under... |         Marc Webb         |
| Two imprisoned men bond ov... |       Frank Darabont      |
| Luke Skywalker joins force... |        George Lucas       |
| Marty McFly, a 17-year-old... |      Robert Zemeckis      |
| Five high school students,... |        John Hughes        |
| In order to save their hom... |       Richard Donner      |
| A young F.B.I. cadet must ... |       Jonathan Demme      |
| During a preview tour, a t... |      Steven Spielberg     |
| Lion cub and future king S... | Roger Allers, Rob Minkoff |
+-------------------------------+---------------------------+
+-------------------------------+--------+-------------------------------+
|             stars             |  year  |              n

Now we can add another column to the frame with the count the words that appear in the description of the entry

In [93]:
imdb_frame['word_count'] = graphlab.text_analytics.count_words(imdb_frame['description'])

In [94]:
print(imdb_frame.head())

+-------------------------------+---------------------------+
|          description          |          director         |
+-------------------------------+---------------------------+
| The lives of two mob hit m... |     Quentin Tarantino     |
| When New York is put under... |         Marc Webb         |
| Two imprisoned men bond ov... |       Frank Darabont      |
| Luke Skywalker joins force... |        George Lucas       |
| Marty McFly, a 17-year-old... |      Robert Zemeckis      |
| Five high school students,... |        John Hughes        |
| In order to save their hom... |       Richard Donner      |
| A young F.B.I. cadet must ... |       Jonathan Demme      |
| During a preview tour, a t... |      Steven Spielberg     |
| Lion cub and future king S... | Roger Allers, Rob Minkoff |
+-------------------------------+---------------------------+
+-------------------------------+--------+-------------------------------+
|             stars             |  year  |              n

Let's grab the Pulp Fiction and check out what are the word counts in the description

In [95]:
pulp_fiction = imdb_frame[imdb_frame['name'] == 'Pulp Fiction']

In [99]:
print(pulp_fiction[['word_count']].
      stack('word_count', new_column_name=['word', 'count']).sort('count', ascending=False))

+-------------+-------+
|     word    | count |
+-------------+-------+
|      a      |   3   |
|      of     |   3   |
|     and     |   2   |
| redemption. |   1   |
|     pair    |   1   |
|    diner    |   1   |
|    wife,    |   1   |
|     four    |   1   |
|     hit     |   1   |
|   bandits   |   1   |
+-------------+-------+
[23 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


We see 'a', 'of', 'and' are the most repeated words in that short description.
Now let's calculate the **term frequency-inverse document frequency** statistic:

In [101]:
imdb_frame['tf_idf'] = graphlab.text_analytics.tf_idf(imdb_frame['word_count'])

In [102]:
print(imdb_frame.head())

+-------------------------------+---------------------------+
|          description          |          director         |
+-------------------------------+---------------------------+
| The lives of two mob hit m... |     Quentin Tarantino     |
| When New York is put under... |         Marc Webb         |
| Two imprisoned men bond ov... |       Frank Darabont      |
| Luke Skywalker joins force... |        George Lucas       |
| Marty McFly, a 17-year-old... |      Robert Zemeckis      |
| Five high school students,... |        John Hughes        |
| In order to save their hom... |       Richard Donner      |
| A young F.B.I. cadet must ... |       Jonathan Demme      |
| During a preview tour, a t... |      Steven Spielberg     |
| Lion cub and future king S... | Roger Allers, Rob Minkoff |
+-------------------------------+---------------------------+
+-------------------------------+--------+-------------------------------+
|             stars             |  year  |              n

### Comparint the word count to weights of tf-idf statistic

Since we have calculated the tf-idf statistic for the data set, we can now compare the 2 columns and see that tf-idf statistic actually assignes weights very reasonably:

In [104]:
pulp_fiction = imdb_frame[imdb_frame['name'] == "Pulp Fiction"]

In [146]:
pulp_fiction[['tf_idf']].stack('tf_idf', new_column_name=['tf_idf', 'count']).\
            sort('count', ascending=False).print_rows(num_rows=23)

+-------------+-----------------+
|    tf_idf   |      count      |
+-------------+-----------------+
|  gangster's |  8.48673398393  |
|   bandits   |  8.08126887582  |
| redemption. |  8.08126887582  |
|    boxer,   |  8.08126887582  |
|    diner    |  7.79358680337  |
|  intertwine |  7.79358680337  |
|     (154    |  6.78198589169  |
|     men,    |   6.6949745147  |
|    tales    |   6.2895094066  |
|   violence  |  6.18414889094  |
|    wife,    |  5.71414526169  |
|     mob     |  5.37321867472  |
|     pair    |  5.17254797926  |
|     hit     |  5.17254797926  |
|     four    |  4.18266889073  |
|    lives    |  3.84716237123  |
|     two     |  2.64900353677  |
|      of     |  1.96713707802  |
|     and     |  1.51751688365  |
|      in     |  0.924572352706 |
|      a      |  0.790125253357 |
|     the     |  0.420840436967 |
|    mins.)   | 0.0275937239639 |
+-------------+-----------------+
[23 rows x 2 columns]



Note that now the more important words in the description are 'gangster', 'bandits', 'redemption'. Also the common words are weighted less, because they are common in the corpus.

### Building the Nearest Neighbor Model
Now we can build the nearest neighbor model and see if such short descriptions is enough to retrieve similar to the desired title movies

In [107]:
# defining the model
tf_idf_model = graphlab.nearest_neighbors.create(imdb_frame, features=['tf_idf'], label='name')

Let's look what is the closest neighbor to Pulp Fiction and what are the tf-idf weights in those movies:

In [109]:
print(tf_idf_model.query(pulp_fiction))

+-------------+-----------------+----------------+------+
| query_label | reference_label |    distance    | rank |
+-------------+-----------------+----------------+------+
|      0      |   Pulp Fiction  |      0.0       |  1   |
|      0      |    Ill Manors   | 0.758620689655 |  2   |
|      0      |   Another Year  | 0.771428571429 |  3   |
|      0      |  Boiling Point  | 0.777777777778 |  4   |
|      0      |  New Year's Eve | 0.787878787879 |  5   |
+-------------+-----------------+----------------+------+
[5 rows x 4 columns]



We see that the **distances** are pretty big, but let's see what the closest movies are:

In [110]:
ill_manors = imdb_frame[imdb_frame['name'] == "Ill Manors"]

In [113]:
print(ill_manors['description'])

['The lives of four drug dealers, one user and two prostitutes. (121 mins.)', ... ]


In [121]:
another_year = imdb_frame[imdb_frame['name'] == "Another Year"]

In [122]:
print(another_year['description'])

['A look at four seasons in the lives of a happily married couple and their relationships with their family and friends. (129 mins.)', ... ]


In [123]:
boiling_point = imdb_frame[imdb_frame['name'] == "Boiling Point"]

In [124]:
print(boiling_point['description'])

['A pair of sociopath killers take on the police and the mob in order to make one last big score. (92 mins.)', ... ]


In [125]:
new_years_eve = imdb_frame[imdb_frame['name'] == "New Year's Eve"]

In [126]:
print(new_years_eve['description'])

["The lives of several couples and singles in New York intertwine over the course of New Year's Eve. (113 mins.)", ... ]


"Ill Manors" and "Boiling Point" seems reasonable fit Pulp Fiction, but "Another Year" and "New Year's Eve" seem a little bit off, let's look at the tf-idf weights in those movies:

In [117]:
print(ill_manors[['tf_idf']].stack('tf_idf', new_column_name=['tf_idf', 'count']).sort('count', ascending=False))

+--------------+----------------+
|    tf_idf    |     count      |
+--------------+----------------+
|   dealers,   | 8.48673398393  |
| prostitutes. | 8.48673398393  |
|     user     | 8.48673398393  |
|     (121     |  4.737229908   |
|     drug     | 4.41770722969  |
|     four     | 4.18266889073  |
|    lives     | 3.84716237123  |
|     one      | 2.93765789904  |
|     two      | 2.64900353677  |
|     and      | 0.758758441826 |
+--------------+----------------+
[13 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [127]:
print(another_year[['tf_idf']].stack('tf_idf', new_column_name=['tf_idf', 'count']).sort('count', ascending=False))

+---------------+---------------+
|     tf_idf    |     count     |
+---------------+---------------+
|    seasons    | 8.48673398393 |
|    happily    | 6.98265658716 |
| relationships | 5.96100533962 |
|    friends.   | 5.77868378283 |
|      (129     | 5.30868015358 |
|      look     | 4.81043331202 |
|    married    | 4.65809258744 |
|     couple    | 4.33569407803 |
|     their     | 4.28570709961 |
|      four     | 4.18266889073 |
+---------------+---------------+
[20 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [128]:
print(boiling_point[['tf_idf']].stack('tf_idf', new_column_name=['tf_idf', 'count']).sort('count', ascending=False))

+-----------+---------------+
|   tf_idf  |     count     |
+-----------+---------------+
|   score.  | 8.08126887582 |
| sociopath | 8.08126887582 |
|  killers  | 6.40729244225 |
|    mob    | 5.37321867472 |
|    pair   | 5.17254797926 |
|    big    | 4.84914782421 |
|    last   | 4.40075767138 |
|    make   | 4.18266889073 |
|   police  | 4.08613096368 |
|   order   | 3.98138413323 |
+-----------+---------------+
[21 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [129]:
print(new_years_eve[['tf_idf']].stack('tf_idf', new_column_name=['tf_idf', 'count']).sort('count', ascending=False))

+------------+---------------+
|   tf_idf   |     count     |
+------------+---------------+
| intertwine | 7.79358680337 |
|  singles   | 7.79358680337 |
|   year's   | 6.78198589169 |
|    eve.    | 6.47183096339 |
|  couples   | 6.00182733414 |
|   course   | 5.81258533451 |
|  several   | 5.32973356278 |
|    new     |  5.2348741416 |
|    (113    | 4.29707924191 |
|    york    | 3.88657633977 |
+------------+---------------+
[17 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


### More nearest neighbors

Let's look at some other suggestions:

In [137]:
future = imdb_frame[imdb_frame['name'] == "Back to the Future"]

In [141]:
print(tf_idf_model.query(future))

+-------------+------------------------+----------------+------+
| query_label |    reference_label     |    distance    | rank |
+-------------+------------------------+----------------+------+
|      0      |   Back to the Future   |      0.0       |  1   |
|      0      |  Liberty Stands Still  | 0.795454545455 |  2   |
|      0      | Bang Bang You're Dead  | 0.809523809524 |  3   |
|      0      | Cornbread, Earl and Me | 0.810810810811 |  4   |
|      0      |    Mad About Mambo     | 0.813953488372 |  5   |
+-------------+------------------------+----------------+------+
[5 rows x 4 columns]



In [139]:
harry_potter = imdb_frame[imdb_frame['name'] == "Harry Potter and the Sorcerer's Stone"]

In [142]:
print(tf_idf_model.query(harry_potter))

+-------------+-------------------------------+----------------+------+
| query_label |        reference_label        |    distance    | rank |
+-------------+-------------------------------+----------------+------+
|      0      | Harry Potter and the Sorce... |      0.0       |  1   |
|      0      |          The Good Son         |     0.725      |  2   |
|      0      |             North             |      0.75      |  3   |
|      0      |         Cloak & Dagger        | 0.763157894737 |  4   |
|      0      |      The Christmas Shoes      |     0.775      |  5   |
+-------------+-------------------------------+----------------+------+
[5 rows x 4 columns]



### Possible problems with the tf-idf model

We see that some of the suggestions could be better. One of the reasons for the lack of accuracy is that the descriptions are too short. Lack of the descriptive words makes it more difficult to weight a movie:

In [143]:
imdb_frame[1099]['description']

'A behind-the-scenes look at the life-and-death struggles of modern-day gladiators and those who lead them. (162 mins.)'

In [144]:
imdb_frame[5690]['description']

'A woman finds romance when she takes a job at an aircraft plant to help make ends meet after her husband goes off to war. (100 mins.)'

So, evidently, if we obtain better summaries of the movies, the tf-idf model would assign more weights to important words, making the suggestions more accurate.