# Potential datasets:
- Bitcoin: https://www.kaggle.com/sudalairajkumar/cryptocurrencypricehistory#bitcoin_dataset.csv
- Car image classifiactoin: https://ai.stanford.edu/~jkrause/cars/car_dataset.html
- cancer risk classification: https://www.kaggle.com/loveall/cervical-cancer-risk-classification
- gender classification: https://www.kaggle.com/crowdflower/twitter-user-gender-classification
- house price prediction: https://www.kaggle.com/shree1992/housedata
- student grade prediction(linear regression): https://www.kaggle.com/dipam7/student-grade-prediction
---


# Final Project 
- Dataset: https://www.kaggle.com/PromptCloudHQ/imdb-data
- Notes: https://www.analyticsvidhya.com/blog/2019/04/predicting-movie-genres-nlp-multi-label-classification/
---

### Description about Dataset

- 1,000 most popular movies on IMDB in 10 years from 2006-2016. 
- Feature: 
    - Title, Genre, Description, Director, Actors, Year, Runtime, Rating, Votes, Revenue, Metascrore


# Goal: Predicting the genre of the movie using the description

# Importing the packeges

In [19]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# Binary Relevance
from sklearn.multiclass import OneVsRestClassifier
# Performance metric
from sklearn.metrics import f1_score

from sklearn.preprocessing import MultiLabelBinarizer

In [7]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sukhrobjongolibboev/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Data Exploration

## GOAL: More EDA and Visualization!

In [21]:
# read the file
df = pd.read_csv("movies.csv")

In [4]:
df.head()

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0


In [5]:
# looking at the length of the 1st description to get a picture how long a descriotion can be
word_count = (df["Description"].values[0]).split(" ")
print("list of words from a description: {}\n Word count:{}".format(word_count, len(word_count)))


list of words from a description: ['A', 'group', 'of', 'intergalactic', 'criminals', 'are', 'forced', 'to', 'work', 'together', 'to', 'stop', 'a', 'fanatical', 'warrior', 'from', 'taking', 'control', 'of', 'the', 'universe.']
 Word count:21


In [9]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

# function to remove stopwords
def remove_stopwords(text):
    no_stopword_text = [word for word in text.split() if not word in stop_words]
    return ' '.join(no_stopword_text)

df['Description'] = df['Description'].apply(lambda x: remove_stopwords(x.lower()))

# Steps:
1. Spliting input and target (input is descriptoin, movie genre would be the output)
2. One-hot encoding 
3. multi-label classification
    - Since movies are not one-dimensional. One movie can span several genres. Now THAT is a challenge I love to embrace as a data scientist. I extracted a bunch of movie plot summaries and got down to work using this concept of multi-label classification. And the results, even using a simple model, are truly impressive.

In [10]:
dataset = pd.DataFrame(df[["Description", "Genre"]])

### Overwriting Old Genre Data with New Genres Objects

In [11]:
genres = list()

for item in dataset["Genre"]:
    genres.append(item.split(","))

dataset["Genre"] = genres

In [12]:
dataset.head(3)

Unnamed: 0,Description,Genre
0,group intergalactic criminals forced work toge...,"[Action, Adventure, Sci-Fi]"
1,"following clues origin mankind, team finds str...","[Adventure, Mystery, Sci-Fi]"
2,three girls kidnapped man diagnosed 23 distinc...,"[Horror, Thriller]"


In [23]:
data_mlb = MultiLabelBinarizer()
data_mlb.fit(dataset['Genre'])

# transform target variable
target = data_mlb.transform(dataset['Genre'])

In [24]:
# training and validation set
X_train, X_test, y_train, y_test = train_test_split(dataset["Description"], 
                                                    target, 
                                                    train_size=0.75, 
                                                    test_size=0.25)

In [25]:
tfidf_proc = TfidfVectorizer(max_df=0.8, max_features=10000)

X_train_TFIDF = tfidf_proc.fit_transform(X_train)
X_test_TFIDF = tfidf_proc.transform(X_test)

## GOAL: Test multiple different classification models!

In [26]:
logreg_model = LogisticRegression() # we're still doing classification! 
clf = OneVsRestClassifier(logreg_model)

In [27]:
# fit model on train data
clf.fit(X_train_TFIDF, y_train)



OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False),
          n_jobs=None)

In [28]:
# make predictions for validation set
y_pred = clf.predict(X_test_TFIDF)

In [29]:
y_pred[125]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [30]:
data_mlb.inverse_transform(y_pred)[10]

()

In [31]:
# evaluate performance
f1_score(y_test, y_pred, average="micro")

0.24203821656050956

In [32]:
# predict probabilities
y_pred_prob = clf.predict_proba(X_test_TFIDF)
y_pred_prob

array([[0.26715106, 0.36034356, 0.0592857 , ..., 0.1861672 , 0.01973362,
        0.01365276],
       [0.2218248 , 0.22542609, 0.05807306, ..., 0.18844254, 0.01952819,
        0.01234747],
       [0.4934956 , 0.33247897, 0.04794172, ..., 0.19023084, 0.01700441,
        0.01229569],
       ...,
       [0.28662091, 0.18260336, 0.0447138 , ..., 0.21793851, 0.01926057,
        0.01215912],
       [0.26857084, 0.18485276, 0.04946756, ..., 0.21130098, 0.01909779,
        0.01271356],
       [0.30843659, 0.27330017, 0.05148889, ..., 0.14802742, 0.01916243,
        0.01241234]])

In [33]:
t = 0.27 # threshold value
y_pred_corrected = (y_pred_prob >= t).astype(int)

## GOAL: Use `GridSearchCV` or Custom Hyperparameter Tuning to Boost Performance!

In [34]:
# evaluate performance
f1_score(y_test, y_pred_corrected, average="micro")

0.4761904761904762

## Super cool code! 

Another way to perform **multilabel binarization** without the package!

In [35]:
label_binarization = dataset.Genre.str.split(',', expand=True).stack()
mlb_data = pd.get_dummies(label_binarization, prefix='is').groupby(level=0).sum()
dataset = dataset.join(mlb_data)
dataset.drop(columns=["Genre"], inplace=True)

### Supplementary Code to Get Unique Genres

In [None]:
unique_genres = set()

for item in dataset["Genre"]:
    genres = item.split(",")
    for genre in genres:
        if genre not in unique_genres:
            unique_genres.add("{}".format(genre))

unique_genres

In [None]:
for category in unique_genres:
    dataset[category] = 0

In [None]:
dataset.head(3)

In [None]:
dummy = pd.DataFrame(data=["r", "g", "b", "b", "r", "g", "b"], columns=["color"])

In [None]:
dummy

In [None]:
dummy["is_red"], dummy["is_green"], dummy["is_blue"] = 0, 0, 0

In [None]:
for index, item in enumerate(dummy["color"]):
    if item == "red":
        dummy["is_red"].iloc[index] = 1
    if item == "green":
        dummy["is_green"].iloc[index] = 1
    if item == "blue":
        dummy["is_blue"].iloc[index] = 1
        
dummy.drop(columns=["color"], inplace=True)

In [None]:
dummy

In [None]:

mlb_processor = MultiLabelBinarizer()
mlb_processor.fit(dummy["color"])

mlb_data = mlb_processor.transform(dummy["color"])
mlb_data

In [None]:
dummy["is_blue"], dummy["is_green"], dummy["is_red"] = mlb_data[:, 0], mlb_data[:, 1], mlb_data[:, 2]

In [None]:
mlb_processor.classes_

## GOAL: Use WordClouds to Visualize Word Vectorization!

In [20]:
from wordcloud import WordCloud

input_str = "one fish two fish red fish blue fish"
input_data = input_str.split(" ")

freqs = dict()

for word in input_data:
    if word not in freqs:
        freqs[word] = 1
    else:
        freqs[word] += 1
        
wc = WordCloud()
wc.generate_from_frequencies(frequencies=freqs)
# TODO: Visualize WordCloud (WC) using MatPlotLib

ModuleNotFoundError: No module named 'wordcloud'