# Reference materials
0. [Vector Representations of Words](https://www.tensorflow.org/tutorials/word2vec)
1. [Word2Vec Tutorial Part 1 - The Skip-Gram Model](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)
2. [Word2Vec Tutorial Part 2 - Negative Sampling](http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/)
3. [From word2vec to doc2vec: an approach driven by Chinese restaurant process](https://medium.com/kifi-engineering/from-word2vec-to-doc2vec-an-approach-driven-by-chinese-restaurant-process-93d3602eaa31)
4. [Doc2Vec tutorial using Gensim](https://medium.com/@klintcho/doc2vec-tutorial-using-gensim-ab3ac03d3a1)
5. [Sentiment Analysis Using Doc2Vec](http://linanqiu.github.io/2015/10/07/word2vec-sentiment/)
6. [Understanding Convolutional Neural Networks for NLP](http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/)
7. [Implementing a CNN for Text Classification in TensorFlow](http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/)
8. [cnn-text-classification-github](https://github.com/cahya-wirawan/cnn-text-classification-tf)

# How to do document clustering with word2vec and doc2vec?

**Steps:**
1. Train a word2vec model to get word vectors
    - Made word2vec_starter.py to work
    - Question: How to feed the embeddings to the doc2vec model?
        - Read in pre-trained embeddings
2. Get doc2vec from word2vec, but HOW(in Tensorflow)???
    - Chinese Restaurant Process method
    - Gensim doc2vec
    - CNN with multi-genre classification(See Viola's work)
3. Clustering
    - Feed document vectors as features to a classification model
    - Calculate cosine similarity between document vectors, maybe cluster documents using the DBSCAN method?

# Section 0: Import packages

In [30]:
import numpy as np
import pandas as pd
import codecs
from collections import Counter

# Section 1: Data preparation

In [3]:
imdb_with_storyline = pd.read_csv("../../01_Data/Outputs/imdb_with_storyline.csv")

In [31]:
genres = Counter([item for sublist in list(imdb_with_storyline['genres'].str.split('|')) for item in sublist])
genres = pd.DataFrame(genres.most_common()).set_index([0], drop=True)
del genres.index.name
genres.columns = ['Count']

In [33]:
genres.loc[genres['Count']>50].index.tolist()
# convert other genres to 'Others'

['Drama',
 'Comedy',
 'Thriller',
 'Action',
 'Romance',
 'Adventure',
 'Crime',
 'Sci-Fi',
 'Fantasy',
 'Horror',
 'Family',
 'Mystery',
 'Biography',
 'Animation',
 'Music',
 'War',
 'History',
 'Sport',
 'Musical',
 'Documentary',
 'Western']

In [38]:
genres.loc[genres['Count']<50].index.tolist()

['Film-Noir', 'Short', 'News', 'Reality-TV', 'Game-Show']

In [45]:
imdb = imdb_with_storyline[['storyline', 'genres']].set_index(imdb_with_storyline['movie_title'])
imdb = pd.concat([imdb['storyline'], imdb['genres'].apply(lambda x: '|'.join(pd.Series(x))).str.get_dummies()], axis=1)
imdb['Others'] = imdb[genres.loc[genres['Count']<50].index.tolist()].sum(axis=1).apply(lambda x: 1 if x > 0 else 0)
imdb = imdb[['storyline'] + genres.loc[genres['Count']>50].index.tolist() + ['Others']]

In [49]:
imdb['Others'].value_counts()

0    5027
1      16
Name: Others, dtype: int64

In [50]:
imdb.to_csv("../../01_Data/Outputs/storyline_with_genres.csv", index=True)

In [14]:
imdb = pd.read_csv("../../01_Data/Outputs/storyline_with_genres.csv", index_col=0)

In [16]:
imdb.head()

Unnamed: 0_level_0,storyline,Action,Adventure,Animation,Biography,Comedy,Crime,Documentary,Drama,Family,...,Mystery,News,Reality-TV,Romance,Sci-Fi,Short,Sport,Thriller,War,Western
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,"When his brother is killed in a robbery, parap...",1,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
Pirates of the Caribbean: At World's End,"After Elizabeth, Will, and Captain Barbossa re...",1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Spectre,A cryptic message from the past sends James Bo...,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
The Dark Knight Rises,Despite his tarnished reputation after the eve...,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
Star Wars: Episode VII - The Force Awakens,,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
imdb['storyline'].to_csv("../../01_Data/Outputs/storyline.txt", sep="\n", index=False)