# Can a movie's plot summary determine its MPAA rating?


Using a dataset containing a random sample of 1,000 movies, their star ratings on IMDb, their MPAA rating, their genres and durations, I used Selenium to scrape the first paragraph of each of the movie's pages from Wikipedia. I wanted to use IMDb, but they categorize each movie by a unique ID and I couldn't immediately get the API to work and match the title to ID. I merged these descriptions with the rest of the dataframe.

I then built a classifer to determine whether certain words in the descriptions lend themselves to certain MPAA ratings. While crime/horror movies are generally more likely to be R-rated, I wondered if you would know that based purely on their descriptions. The classifiers I used – Linear SVC and Random Forest – both had poor clf scores of just around 50. I believe that's not because there is no relation between the text and ratings, but rather because of flaws in Wikipedia's description. The paragraph generally includes words like "directed", "written," etc. which occur most frequently and are otherwise irrelevant to the classification, yet the classifier lists them as the most important features. In general, the features were a bit random – not only did they have little to do with the movie’s plot, but even less to do with the rating. I was not able to add my own stop words (I tried an nltk package and another manual solution from Stack Overflow, without success). I would maybe repeat this classifier with better data from IMDb and/or a more thorough stop words list.

 

### Classification Process

I began with 1000+ pieces of text, scraped from Wikipedia. Since the original dataset labeled all the MPAA ratings, I did not need to label them manually. I then wrote a number of different classifiers from available ones in Python to find the 'best' results, according to accuracy score and confusion matrix. Finally, I looked at the most important features.




## Scraping movie descriptions

In [2]:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd



In [2]:
df = pd.read_csv('imdb_1000.csv')

In [3]:
df.head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."


In [4]:
df.shape

(979, 6)

In [5]:
movie_urls = []
for eachname in df.title:
        eachname = eachname.replace(" ","_")
        if eachname == 'City_of_God':
            eachname = eachname.replace('City_of_God', 'City_of_God_(2002_film)')
        if eachname == 'Whiplash':
            eachname = eachname.replace('Whiplash', 'Whiplash_(2013_film)')
        movie_urls.append(f"https://en.wikipedia.org/wiki/{eachname}")
        
movie_urls

['https://en.wikipedia.org/wiki/The_Shawshank_Redemption',
 'https://en.wikipedia.org/wiki/The_Godfather',
 'https://en.wikipedia.org/wiki/The_Godfather:_Part_II',
 'https://en.wikipedia.org/wiki/The_Dark_Knight',
 'https://en.wikipedia.org/wiki/Pulp_Fiction',
 'https://en.wikipedia.org/wiki/12_Angry_Men',
 'https://en.wikipedia.org/wiki/The_Good,_the_Bad_and_the_Ugly',
 'https://en.wikipedia.org/wiki/The_Lord_of_the_Rings:_The_Return_of_the_King',
 "https://en.wikipedia.org/wiki/Schindler's_List",
 'https://en.wikipedia.org/wiki/Fight_Club',
 'https://en.wikipedia.org/wiki/The_Lord_of_the_Rings:_The_Fellowship_of_the_Ring',
 'https://en.wikipedia.org/wiki/Inception',
 'https://en.wikipedia.org/wiki/Star_Wars:_Episode_V_-_The_Empire_Strikes_Back',
 'https://en.wikipedia.org/wiki/Forrest_Gump',
 'https://en.wikipedia.org/wiki/The_Lord_of_the_Rings:_The_Two_Towers',
 'https://en.wikipedia.org/wiki/Interstellar',
 "https://en.wikipedia.org/wiki/One_Flew_Over_the_Cuckoo's_Nest",
 'https://

In [6]:
!pip install webdriver-manager

You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [3]:
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
# driver = webdriver.Chrome()

[WDM] - Current google-chrome version is 89.0.4389
[WDM] - Get LATEST driver version for 89.0.4389






[WDM] - Get LATEST driver version for 89.0.4389
[WDM] - Trying to download new driver from https://chromedriver.storage.googleapis.com/89.0.4389.23/chromedriver_mac64.zip
[WDM] - Driver has been saved in cache [/Users/paromasoni/.wdm/drivers/chromedriver/mac64/89.0.4389.23]


In [8]:
driver.get('https://en.wikipedia.org/wiki/The_Shawshank_Redemption')

desc = driver.find_element_by_xpath("/html/body/div[3]/div[3]/div[5]/div[1]/p[2]").text

In [9]:
import time

all_descs = []

for url in movie_urls:
    driver.get(url)
    
    try:
        desc = driver.find_element_by_xpath("/html/body/div[3]/div[3]/div[5]/div[1]/p[2]").text
    
    except:
        try:
            driver.get(url + "_(film)")
            desc = driver.find_element_by_xpath("/html/body/div[3]/div[3]/div[5]/div[1]/p[2]").text
        except:
            desc = 'na'

    
    all_descs.append(desc)

    time.sleep(0.5)
    
all_descs

['The Shawshank Redemption is a 1994 American drama film written and directed by Frank Darabont, based on the 1982 Stephen King novella Rita Hayworth and Shawshank Redemption. It tells the story of banker Andy Dufresne (Tim Robbins), who is sentenced to life in Shawshank State Penitentiary for the murders of his wife and her lover, despite his claims of innocence. Over the following two decades, he befriends a fellow prisoner, contraband smuggler Ellis "Red" Redding (Morgan Freeman), and becomes instrumental in a money-laundering operation led by the prison warden Samuel Norton (Bob Gunton). William Sadler, Clancy Brown, Gil Bellows, and James Whitmore appear in supporting roles.',
 "The Godfather is a 1972 American crime film directed by Francis Ford Coppola, who co-wrote the screenplay with Mario Puzo, based on Puzo's best-selling 1969 novel of the same name. The film stars Marlon Brando, Al Pacino, James Caan, Richard Castellano, Robert Duvall, Sterling Hayden, John Marley, Richard 

In [10]:
all_descs

['The Shawshank Redemption is a 1994 American drama film written and directed by Frank Darabont, based on the 1982 Stephen King novella Rita Hayworth and Shawshank Redemption. It tells the story of banker Andy Dufresne (Tim Robbins), who is sentenced to life in Shawshank State Penitentiary for the murders of his wife and her lover, despite his claims of innocence. Over the following two decades, he befriends a fellow prisoner, contraband smuggler Ellis "Red" Redding (Morgan Freeman), and becomes instrumental in a money-laundering operation led by the prison warden Samuel Norton (Bob Gunton). William Sadler, Clancy Brown, Gil Bellows, and James Whitmore appear in supporting roles.',
 "The Godfather is a 1972 American crime film directed by Francis Ford Coppola, who co-wrote the screenplay with Mario Puzo, based on Puzo's best-selling 1969 novel of the same name. The film stars Marlon Brando, Al Pacino, James Caan, Richard Castellano, Robert Duvall, Sterling Hayden, John Marley, Richard 

In [11]:
df['desc'] = all_descs

In [12]:
df

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list,desc
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...",The Shawshank Redemption is a 1994 American dr...
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']",The Godfather is a 1972 American crime film di...
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv...",The Godfather Part II is a 1974 American epic ...
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E...",The Dark Knight is a 2008 superhero film direc...
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L....",Pulp Fiction is a 1994 American neo-noir black...
...,...,...,...,...,...,...,...
974,7.4,Tootsie,PG,Comedy,116,"[u'Dustin Hoffman', u'Jessica Lange', u'Teri G...",Tootsie was a major critical and financial suc...
975,7.4,Back to the Future Part III,PG,Adventure,118,"[u'Michael J. Fox', u'Christopher Lloyd', u'Ma...",Back to the Future Part III is a 1990 American...
976,7.4,Master and Commander: The Far Side of the World,PG-13,Action,138,"[u'Russell Crowe', u'Paul Bettany', u'Billy Bo...",Master and Commander: The Far Side of the Worl...
977,7.4,Poltergeist,PG,Horror,114,"[u'JoBeth Williams', u""Heather O'Rourke"", u'Cr...",They have traditionally been described as trou...


In [14]:
df[df.desc == 'na']

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list,desc
39,8.6,Psycho,R,Horror,109,"[u'Anthony Perkins', u'Janet Leigh', u'Vera Mi...",na
65,8.4,Witness for the Prosecution,APPROVED,Crime,116,"[u'Tyrone Power', u'Marlene Dietrich', u'Charl...",na
88,8.4,The Kid,NOT RATED,Comedy,68,"[u'Charles Chaplin', u'Edna Purviance', u'Jack...",na
115,8.3,Scarface,R,Crime,170,"[u'Al Pacino', u'Michelle Pfeiffer', u'Steven ...",na
123,8.3,The General,UNRATED,Action,107,"[u'Buster Keaton', u'Marion Mack', u'Glen Cave...",na
...,...,...,...,...,...,...,...
916,7.5,Up in the Air,R,Drama,109,"[u'George Clooney', u'Vera Farmiga', u'Anna Ke...",na
918,7.5,Running Scared,R,Action,122,"[u'Paul Walker', u'Cameron Bright', u'Chazz Pa...",na
936,7.4,True Grit,,Adventure,128,"[u'John Wayne', u'Kim Darby', u'Glen Campbell']",na
950,7.4,Bound,R,Crime,108,"[u'Jennifer Tilly', u'Gina Gershon', u'Joe Pan...",na


In [16]:
clean_df = df[~(df.desc == 'na')]

In [17]:
clean_df.shape

(891, 7)

In [18]:
clean_df.content_rating.isna().value_counts()

False    889
True       2
Name: content_rating, dtype: int64

In [24]:
# clean_df.to_csv('clean_df.csv')

In [13]:
clean_df = pd.read_csv('clean_df.csv')

In [20]:
clean_df.desc = clean_df.desc.fillna('')

In [21]:
clean_df.desc.isna().value_counts()

False    891
Name: desc, dtype: int64

## Building the classifer

In [None]:
#Limiting words to english letters only

vectorizer = TfidfVectorizer(token_pattern=r"\b[A-Za-z']+\b")

In [74]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Make a vectorizer
vectorizer = TfidfVectorizer(stop_words='english', min_df=75)


# Learn and count the words in df.content
matrix = vectorizer.fit_transform(clean_df.desc)
words_df = pd.DataFrame(matrix.toarray(),
                        columns=vectorizer.get_feature_names())
words_df

Unnamed: 0,2001,2003,2004,2005,2006,2007,2009,2010,2013,2014,...,woman,won,world,writer,written,wrote,year,years,york,young
0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.146934,0.00000,0.0,0.0,0.000000,0.0
1,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.000000,0.24367,0.0,0.0,0.000000,0.0
2,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.082371,0.00000,0.0,0.0,0.178234,0.0
3,0.0,0.000000,0.0,0.253215,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.110272,0.00000,0.0,0.0,0.000000,0.0
4,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.142004,0.00000,0.0,0.0,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.000000,0.00000,0.0,0.0,0.000000,0.0
887,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.000000,0.00000,0.0,0.0,0.000000,0.0
888,0.0,0.260762,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.219301,0.0,0.117157,0.00000,0.0,0.0,0.000000,0.0
889,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.000000,0.00000,0.0,0.0,0.000000,0.0


In [73]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
clean_df['rating_label'] = le.fit_transform(clean_df.content_rating)
clean_df.head()

Unnamed: 0.1,Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list,desc,rating_label,is_R_rated
0,0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...",The Shawshank Redemption is a 1994 American dr...,8,1
1,1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']",The Godfather is a 1972 American crime film di...,8,1
2,2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv...",The Godfather Part II is a 1974 American epic ...,8,1
3,3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E...",The Dark Knight is a 2008 superhero film direc...,7,0
4,4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L....",Pulp Fiction is a 1994 American neo-noir black...,8,1


In [24]:
# X = words_df
# y = clean_df.rating_label

In [30]:
clean_df['is_R_rated'] =  (clean_df.content_rating == 'R').astype(int)

In [31]:
X = words_df
y = clean_df.is_R_rated

In [32]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

### Linear SVC

In [66]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [67]:
from sklearn.svm import LinearSVC 
clf = LinearSVC(max_iter=100000)
clf.fit(X_train, y_train)

LinearSVC(max_iter=100000)

In [76]:
# from sklearn.metrics import confusion_matrix

# y_true = y_test
# y_pred = clf.predict(X_test)
# matrix = confusion_matrix(y_true, y_pred)

# label_names = pd.Series(le.classes_)
# pd.DataFrame(matrix,
#      columns='Predicted ' + label_names,
#      index='Is ' + label_names)

# Getting this error: ValueError: Shape of passed values is (10, 10), indices imply (13, 13) -- and it keeps changing. 
#  Sometimes it says (2,2) or sometimes (12,12)


In [69]:
clf.fit(X_train, y_train)

LinearSVC(max_iter=100000)

In [70]:
clf.score(X_test, y_test)

0.5874439461883408

In [71]:
from sklearn.metrics import confusion_matrix

y_true = y_test
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['not R rated', 'R rated'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)


Unnamed: 0,Predicted not R rated,Predicted R rated
Is not R rated,77,49
Is R rated,43,54


In [72]:
import eli5

feature_names = list(words_df.columns)
eli5.show_weights(clf, feature_names=feature_names)

Weight?,Feature
+2.147,director
+1.546,horror
+1.307,york
+1.265,2006
+1.189,actor
+1.169,thriller
+1.127,death
+1.078,awards
+1.012,2003
+1.009,david


### Random forest

In [59]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [60]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=10)
clf.fit(X_train, y_train)

RandomForestClassifier(n_estimators=10)

In [61]:
clf.score(X_test, y_test)

0.5112107623318386

In [62]:
from sklearn.metrics import confusion_matrix

y_true = y_test
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['not R rated', 'R rated'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)


Unnamed: 0,Predicted not R rated,Predicted R rated
Is not R rated,76,29
Is R rated,80,38


In [63]:
#these are terrible results

In [64]:
import eli5

feature_names = list(words_df.columns)
eli5.show_weights(clf, feature_names=feature_names)

Weight,Feature
0.0305  ± 0.0266,directed
0.0226  ± 0.0289,thriller
0.0223  ± 0.0186,written
0.0196  ± 0.0326,produced
0.0181  ± 0.0268,american
0.0174  ± 0.0268,story
0.0168  ± 0.0157,black
0.0166  ± 0.0244,film
0.0158  ± 0.0244,comedy
0.0140  ± 0.0161,novel
