# Movie Genre Classification
A Machine Learning model that can predict the genre of a movie based on its plot summary or other textual information. <br>
Since the plot summary is in textual format - we have to use Natural Language Processing (NLP) technique.

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

from nltk.corpus import stopwords

## Importing the dataset

In [2]:
# Column headers
col_names = pd.read_csv("/kaggle/input/genre-classification-dataset-imdb/Genre Classification Dataset/description.txt", sep=":::", engine='python').reset_index().iloc[0].str.strip()
col_names = np.array(col_names)
col_names

array(['ID', 'TITLE', 'GENRE', 'DESCRIPTION'], dtype=object)

In [3]:
df = pd.read_csv("/kaggle/input/genre-classification-dataset-imdb/Genre Classification Dataset/train_data.txt", sep=":::", header=None, engine='python', names=col_names)
df.drop("ID", axis=1, inplace=True)
df.head()

Unnamed: 0,TITLE,GENRE,DESCRIPTION
0,Oscar et la dame rose (2009),drama,Listening in to a conversation between his do...
1,Cupid (1997),thriller,A brother and sister with a past incestuous r...
2,"Young, Wild and Wonderful (1980)",adult,As the bus empties the students for their fie...
3,The Secret Sin (1915),drama,To help their unemployed father make ends mee...
4,The Unrecovered (2007),drama,The film's title refers not only to the un-re...


In [4]:
movie = [x.lower().split('/')[0].strip()[:-7].replace(".","").replace("\"","").replace("'","").replace("$","").replace("?","").replace(":","").replace("[","").replace("]","").replace("&", " and ").replace("-", " to ").replace(",","").replace("!", "") for x in df['TITLE']]
movie[:5] # 5 sample output

['oscar et la dame rose',
 'cupid',
 'young wild and wonderful',
 'the secret sin',
 'the unrecovered']

In [5]:
year = [x.strip()[-5:-1] for x in df["TITLE"]]
year = [int(x) if x.isdigit() else None for x in year]
year[:5] # Sample 5 outputs

[2009, 1997, 1980, 1915, 2007]

In [6]:
# Separating "TITLE" into "MOVIE" and "YEAR"
df['MOVIE'] = movie
df['YEAR'] = year

# Removing the attribute "TITLE" 
df.drop("TITLE", axis=1, inplace=True)

# Filling median value
df["YEAR"] = df["YEAR"].fillna(df['YEAR'][df['YEAR'].notnull()].median()).astype(np.int32)
df.head()

Unnamed: 0,GENRE,DESCRIPTION,MOVIE,YEAR
0,drama,Listening in to a conversation between his do...,oscar et la dame rose,2009
1,thriller,A brother and sister with a past incestuous r...,cupid,1997
2,adult,As the bus empties the students for their fie...,young wild and wonderful,1980
3,drama,To help their unemployed father make ends mee...,the secret sin,1915
4,drama,The film's title refers not only to the un-re...,the unrecovered,2007


In [7]:
# No of null entries in Genre attribute
df['GENRE'].isnull().sum()

0

In [8]:
df["GENRE"] = df["GENRE"].str.lower().str.strip()

In [9]:
# No of null entries in Movie
df["MOVIE"].isnull().sum()

0

In [10]:
df["MOVIE"] = df["MOVIE"].str.strip()

In [11]:
# No of null entries in description
df["DESCRIPTION"].isnull().sum()

0

In [12]:
df["DESCRIPTION"] = df["DESCRIPTION"].str.lower()

In [13]:
df["DESCRIPTION"].iloc[0]

' listening in to a conversation between his doctor and parents, 10-year-old oscar learns what nobody has the courage to tell him. he only has a few weeks to live. furious, he refuses to speak to anyone except straight-talking rose, the lady in pink he meets on the hospital stairs. as christmas approaches, rose uses her fantastical experiences as a professional wrestler, her imagination, wit and charm to allow oscar to live life and love to the full, in the company of his friends pop corn, einstein, bacon and childhood sweetheart peggy blue.'

In [14]:
df["DESCRIPTION"] = [x.replace('-'," ").replace("."," ").replace(","," ").strip() for x in df["DESCRIPTION"]]

## Preprocessing

Eventhough we have taken much effort in preprocessing the movie name, it's optional. <br>
Usually, it is very difficult to predict the genre with movie name. Also, year of the movie has nothing to do with the genre. <br>
So, some may choose to leave the column, which is also logical. But since this is a practice problem, we are trying to explore all possible preprocessing.

Here let's consider the input as a string concatenating both movie and description.

In [15]:
#  Train test split
vectorizer = TfidfVectorizer(stop_words = 'english')
X = df["MOVIE"] + " " + df["DESCRIPTION"]
X_processed = []
for sent in X:
    X_processed.append(' '.join([x for x in sent.split() if x not in stopwords.words("english")]))
X_processed[:5]

['oscar et la dame rose listening conversation doctor parents 10 year old oscar learns nobody courage tell weeks live furious refuses speak anyone except straight talking rose lady pink meets hospital stairs christmas approaches rose uses fantastical experiences professional wrestler imagination wit charm allow oscar live life love full company friends pop corn einstein bacon childhood sweetheart peggy blue',
 'cupid brother sister past incestuous relationship current murderous relationship murders women reject murders women get close',
 'young wild wonderful bus empties students field trip museum natural history little tour guide suspect students another tour first lecture films coeds drift dreams erotic fantasies one imagine films release emotion fantasies erotic uncommon ways one slips curator\'s office little "acquisition " another finds anthropologist see bones identified even head teacher immune soon tour bus departs everyone admits quite education',
 "secret sin help unemployed 

In [16]:
X_processed = vectorizer.fit_transform(X_processed)

X_train, X_test, y_train, y_test = train_test_split(X_processed, df["GENRE"], test_size=0.2, random_state=42)

In [17]:
X_train

<43371x137058 sparse matrix of type '<class 'numpy.float64'>'
	with 2015978 stored elements in Compressed Sparse Row format>

In [18]:
# Naive Bayes
clf = MultinomialNB()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.4444341971779028

In [19]:
# Support Vector Machine
svm = SVC()
svm.fit(X_train, y_train)
y_pred = svm.predict(X_test)
accuracy_score(y_test, y_pred)

0.5663561744904547

In [20]:
# Logistic Regression
lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
accuracy_score(y_pred, y_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.5812966891081804