# Movie Rating Modelling

Tubagus Langlang Purwasasmita

221810634

3SD1 - Politeknik Statistika STIS

## Library Importing

In [32]:
import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestClassifier

from sklearn.preprocessing import OneHotEncoder, LabelEncoder

from sklearn.model_selection import cross_validate
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, confusion_matrix, classification_report

## Loading Data

In [4]:
data = pd.read_excel('data/2014 and 2015 CSM dataset.xlsx')
data.head(6)

Unnamed: 0,Movie,Year,Ratings,Genre,Gross,Budget,Screens,Sequel,Sentiment,Views,Likes,Dislikes,Comments,Aggregate Followers
0,13 Sins,2014,6.3,8,9130,4000000.0,45.0,1,0,3280543,4632,425,636,1120000.0
1,22 Jump Street,2014,7.1,1,192000000,50000000.0,3306.0,2,2,583289,3465,61,186,12350000.0
2,3 Days to Kill,2014,6.2,1,30700000,28000000.0,2872.0,1,0,304861,328,34,47,483000.0
3,300: Rise of an Empire,2014,6.3,1,106000000,110000000.0,3470.0,2,0,452917,2429,132,590,568000.0
4,A Haunted House 2,2014,4.7,8,17300000,3500000.0,2310.0,2,0,3145573,12163,610,1082,1923800.0
5,A Long Way Off,2014,4.6,3,29000,500000.0,,1,0,91137,112,7,1,310000.0


# Eksplorasi Data

### Describing Data

In [5]:
data.describe(include = 'all')

Unnamed: 0,Movie,Year,Ratings,Genre,Gross,Budget,Screens,Sequel,Sentiment,Views,Likes,Dislikes,Comments,Aggregate Followers
count,231,231.0,231.0,231.0,231.0,230.0,221.0,231.0,231.0,231.0,231.0,231.0,231.0,196.0
unique,231,,,,,,,,,,,,,
top,Inherent Vice,,,,,,,,,,,,,
freq,1,,,,,,,,,,,,,
mean,,2014.294372,6.441558,5.359307,68066030.0,47921730.0,2209.244344,1.359307,2.809524,3712851.0,12732.536797,679.051948,1825.701299,3038193.0
std,,0.45675,0.988765,4.141611,88902890.0,54288250.0,1463.767755,0.967241,6.996775,4511104.0,28825.484481,1243.929481,3571.040447,4886278.0
min,,2014.0,3.1,1.0,2470.0,70000.0,2.0,1.0,-38.0,698.0,1.0,0.0,0.0,1066.0
25%,,2014.0,5.8,1.0,10300000.0,9000000.0,449.0,1.0,0.0,623302.0,1776.5,105.5,248.5,183025.0
50%,,2014.0,6.5,3.0,37400000.0,28000000.0,2777.0,1.0,0.0,2409338.0,6096.0,341.0,837.0,1052600.0
75%,,2015.0,7.1,8.0,89350000.0,65000000.0,3372.0,1.0,5.5,5217380.0,15247.5,697.5,2137.0,3694500.0


### Checking Missing Value

In [8]:
data.isna().any(axis = 0)

Movie                  False
Year                   False
Ratings                False
Genre                  False
Gross                  False
Budget                  True
Screens                 True
Sequel                 False
Sentiment              False
Views                  False
Likes                  False
Dislikes               False
Comments               False
Aggregate Followers     True
dtype: bool

Terdapat beberapa missing value pada kolom ```Budget```, ```Screens```, dan ```Aggregate Followers```. Berikut adalah jumlah missing value nya

In [5]:
data[['Budget', 'Screens', 'Aggregate Followers']].isna().sum(axis = 0)

Budget                  1
Screens                10
Aggregate Followers    35
dtype: int64

Untuk sekarang, saya hanya akan menghapus row di dalam data tersebut yang terdapat missing value

## Data Preproces

### Dropping Movie, Year Column and Drop Missing Value Row

Nama Movie dan Year tidak terlalu berpengaruh dalam penentuan rating

In [8]:
# Menghapus kolom Movie dan Year
new_data = data.drop(['Movie', 'Year'], axis = 1)

# Menghapus baris yang terdapat missing value
new_data = new_data.dropna(axis = 0)

### Kategorisasi Label

Saya akan mengkategorikan label rating (karena bentuknya klasifikasi), dengan batasan 3 jenis kategori 

* ```rendah``` : quantile 0% hingga 33%
* ```sedang``` : quantile 33% hingga 66%
* ```tinggi``` : quantile 66% hingga 100%

In [11]:
bounder = np.quantile(data.Ratings, [0.33,0.66])


class_ = ['rendah', 'sedang', 'tinggi']
  
new_data['rating_class'] = data.Ratings.apply(
    lambda x : class_[sum(x > bounder)]
)
new_data.head(5)

Unnamed: 0,Ratings,Genre,Gross,Budget,Screens,Sequel,Sentiment,Views,Likes,Dislikes,Comments,Aggregate Followers,rating_class
0,6.3,8,9130,4000000.0,45.0,1,0,3280543,4632,425,636,1120000.0,sedang
1,7.1,1,192000000,50000000.0,3306.0,2,2,583289,3465,61,186,12350000.0,tinggi
2,6.2,1,30700000,28000000.0,2872.0,1,0,304861,328,34,47,483000.0,sedang
3,6.3,1,106000000,110000000.0,3470.0,2,0,452917,2429,132,590,568000.0,sedang
4,4.7,8,17300000,3500000.0,2310.0,2,0,3145573,12163,610,1082,1923800.0,rendah


### Label and Feature Encoding

Feature kategorikal perlu dilakukan encoding dengan ```OneHotEncoding```, dan label kelas (rating_class) perlu juga di lakukan encoding dengan ```LabelEncoding```

In [20]:
X = new_data.drop('Ratings', axis = 1)
y = X.pop('rating_class')

# One Hot Encoding pada kolom kategorikal
cat_col = ['Genre', 'Sequel']

ohe = OneHotEncoder(sparse = False, categories = 'auto')

# Mennggabungkan data numerik dengan data encoding hasil kategorik
X = np.concatenate(
    (
        # Data Numerik
        X.drop(cat_col, axis = 1).to_numpy(),
        # Data Kategorik
        ohe.fit_transform(X[cat_col])
    ),
    axis = 1
)

# Melakukan encoding pada kelas rating
le = LabelEncoder()
y = le.fit_transform(y)

In [21]:
X[:4,:]

array([[9.130000e+03, 4.000000e+06, 4.500000e+01, 0.000000e+00,
        3.280543e+06, 4.632000e+03, 4.250000e+02, 6.360000e+02,
        1.120000e+06, 0.000000e+00, 0.000000e+00, 0.000000e+00,
        0.000000e+00, 0.000000e+00, 1.000000e+00, 0.000000e+00,
        0.000000e+00, 0.000000e+00, 0.000000e+00, 1.000000e+00,
        0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00,
        0.000000e+00, 0.000000e+00],
       [1.920000e+08, 5.000000e+07, 3.306000e+03, 2.000000e+00,
        5.832890e+05, 3.465000e+03, 6.100000e+01, 1.860000e+02,
        1.235000e+07, 1.000000e+00, 0.000000e+00, 0.000000e+00,
        0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00,
        0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00,
        1.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00,
        0.000000e+00, 0.000000e+00],
       [3.070000e+07, 2.800000e+07, 2.872000e+03, 0.000000e+00,
        3.048610e+05, 3.280000e+02, 3.400000e+01, 4.700000e+01,
        4.830000e+05, 1.000000

In [24]:
y

array([1, 2, 1, 1, 0, 0, 2, 1, 0, 2, 0, 0, 0, 0, 1, 1, 2, 2, 1, 1, 2, 0,
       1, 2, 2, 1, 0, 2, 1, 2, 0, 0, 2, 1, 2, 2, 1, 0, 1, 2, 2, 0, 0, 1,
       2, 2, 0, 2, 0, 0, 1, 2, 2, 2, 0, 1, 0, 2, 1, 2, 0, 2, 1, 1, 1, 1,
       0, 0, 2, 0, 2, 0, 2, 0, 0, 0, 1, 1, 0, 2, 0, 0, 1, 0, 0, 2, 0, 0,
       0, 0, 2, 1, 2, 2, 2, 0, 2, 0, 1, 2, 2, 1, 2, 2, 1, 2, 0, 2, 1, 2,
       0, 1, 2, 0, 1, 0, 1, 2, 2, 0, 0, 1, 0, 0, 2, 2, 1, 2, 2, 1, 1, 2,
       1, 2, 2, 2, 2, 1, 1, 2, 1, 0, 0, 1, 2, 2, 1, 2, 1, 1, 0, 2, 1, 0,
       0, 1, 1, 1, 0, 0, 0, 2, 2, 2, 2, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0,
       2, 0, 1, 0, 0, 0, 2, 0, 0, 2, 1])

### Splitting Data

Testing data -> 20%

In [25]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 42)

X_train.shape

(149, 26)

## Modelling

Model yang digunakan untuk sekarang (keperluan UTS) adalah ```RandomForestClassifier```. 

In [27]:
model = RandomForestClassifier(criterion = 'entropy', random_state=42)

model.fit(X_train, y_train)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

### Evaluating

In [28]:
y_pred = model.predict(X_test)


print('Confusion Matriks\n',confusion_matrix(y_test, y_pred), '\n')
print('Accuracy =', accuracy_score(y_test, y_pred))
print('Precisi =', precision_score(y_test, y_pred, average='weighted'))
print('Recall =', recall_score(y_test, y_pred, average='weighted'))
print('F1 Score =', f1_score(y_test, y_pred, average='weighted'))

Confusion Matriks
 [[9 3 3]
 [6 3 0]
 [5 0 9]] 

Accuracy = 0.5526315789473685
Precisi = 0.5723684210526315
Recall = 0.5526315789473685
F1 Score = 0.552805089647195


In [30]:
cross_validate(model, X_train, y_train, scoring = 'accuracy', cv = 5)

{'fit_time': array([0.12480021, 0.12479997, 0.1092    , 0.22361183, 0.15520287]),
 'score_time': array([0.0156002 , 0.0156002 , 0.0156002 , 0.02000117, 0.01559973]),
 'test_score': array([0.48387097, 0.4       , 0.3       , 0.43333333, 0.5       ])}

## Kesimpulan

Dengan model Random Forest, rating film dapat di prediksi dengan akurasi merentang dari 0.4 hingga 0.5 

Tentunya akurasi tersebut masih belum cukup bagus. Beberapa hal yang dapat dilakukan kedepannya adalah 

* Menambah dataset
* Menggunakan model lain, seperti SVM, MultipleLogisticRegression, Neural Network
* Melakukan hyperparameter tuning pada setiap model untuk mencari model paramater terbaik