Movie Rate Prediction with Tensorflow
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
image
model
.gitignore
LICENSE
README.md
config.py
dataloader.py
db.py
main.py
movie-parser.py
opt_requirements.txt
preprocessing.py
table.sql
tfutil.py
visualize.py

README.md

Movie Rate Prediction

영화 평점 예측 with Tensorflow

License: MIT Total alerts Language grade: Python

Environments

  • OS : Ubuntu 16.04+ / Windows 10
  • CPU : any (quad core ~)
  • GPU : GTX 1060 6GB ~
  • RAM : 16GB ~
  • Library : TF 1.x with CUDA 9.0~ + cuDNN 7.0~

Prerequisites

  • Python
  • MySQL DB
  • tensorflow 1.x
  • numpy
  • gensim and konlpy and soynlp
  • mecab-ko
  • pymysql
  • h5py
  • tqdm
  • pymysql
  • (Optional) java 1.7+
  • (Optional) PyKoSpacing
  • (Optional) MultiTSNE (for visualization)
  • (Optional) matplotlib (for visualization)

DataSet

DataSet Language Sentences Words Size
NAVER Movie Review Korean 8.86M 391K About 1GB

Movie Review Data Distribution

dist

Usage

1.1 Installing Dependencies

# Necessary
$ sudo python3 -m pip install -r requirements.txt
# Optional
$ sudo python3 -m pip install -r opt_requirements.txt

1.2 Configuration

# In ```config.py```, there're lots of params for scripts. plz re-setting

2. Parsing the DataSet

$ python3 movie-parse.py

3. Making DataSet DB

$ python3 db.py

4. Making w2v/d2v embeddings (skip if u only wanna use Char2Vec)

$ python3 preprocessing.py

usage: preprocessing.py [-h] [--load_from {db,csv}] [--vector {d2v,w2v}]
                        [--is_analyzed IS_ANALYZED]

Pre-Processing NAVER Movie Review Comment

optional arguments:
  -h, --help            show this help message and exit
  --load_from {db,csv}  load DataSet from db or csv
  --vector {d2v,w2v}    d2v or w2v
  --is_analyzed IS_ANALYZED
                        already analyzed data

5. Training a Model

$ python3 main.py --refine_data [True or False]

usage: main.py [-h] [--checkpoint CHECKPOINT] [--refine_data REFINE_DATA]

train/test movie review classification model

optional arguments:
  -h, --help            show this help message and exit
  --checkpoint CHECKPOINT
                        pre-trained model
  --refine_data REFINE_DATA
                        solving data imbalance problem

Repo Tree

│
├── comments          (NAVER Movie Review DataSets)
│    ├── 10000.sql
│    ├── ...
│    └── 200000.sql
├── w2v               (Word2Vec)
│    ├── ko_w2v.model (Word2Vec trained gensim model)
│    └── ...
├── d2v               (Doc2Vec)
│    ├── ko_d2v.model (Dov2Vec trained gensim model)
│    └── ...
├── model             (Movie Review Rate ML Models)
│    ├── textcnn.py
│    └── textrnn.py
├── image             (explaination images)
│    └── *.png
├── ml_model          (tf pre-trained model saved in here)
│    ├── checkpoint
│    ├── ...
│    └── charcnn-best_loss.ckpt
├── config.py         (Configuration)
├── tfutil.py         (handy tfutils)
├── dataloader.py     (Doc/Word2Vec model loader)
├── movie-parser.py   (NAVER Movie Review Parser)
├── db.py             (DataBase processing)
├── preprocessing.py  (Korean normalize/tokenize)
├── visualize.py      (for visualizing w2v)
└── main.py           (for easy use of train/test)

Pre-Trained Models

Here's a google drive link. You can download pre-trained models from here !

  • Embedding Models

    • Word2Vec model : here
  • M.L Models

    • TextCNN model : here
    • TextRNN model : here

Models

  • TextCNN

img

credited by Toxic Comment Classification kaggle 1st solution

  • TextRNN

img

credited by Toxic Comment Classification kaggle 1st solution

Results

DataSet is not good. So, the result also isn't pretty good as i expected :(
Refining/Normalizing raw sentences are needed!

  • TextCNN (Char2Vec)

img

Result : train MSE 1.553, val MSE 3.341
Hyper-Parameter : rand, conv kernel size [10,9,7,5,3], conv filters 256, drop out 0.7, fc unit 1024, adam, embed size 384

  • TextCNN (Word2Vec)

img

Result : train MSE 3.410
Hyper-Parameter : non-static, conv kernel size [2,3,4,5], conv filters 256, drop out 0.7, fc unit 1024, adadelta, embed size 300

  • TextRNN (Word2Vec)

img

Result : train MSE 3.646
Hyper-Parameter : non-static, rnn cells 128, attention 128, drop out 0.7, fc unit 1024, adadelta, embed size 300

  • TextRNN (Char2Vec)

SOON!

Visualization

You can just simply type tensorboard --logdir=./ml_model/

Word2Vec Embeddings (t-SNE)

img

Perplexity : 80
Learning rate : 10
Iteration : 310

To-Do

  1. deal with word spacing problem

ETC

Any suggestions and PRs and issues are WELCONE :)

Author

HyeongChan Kim / @kozistr