<a href="https://colab.research.google.com/github/pn0159/prathima_INFO5731_Fall2020/blob/master/Prathima_In_class_exercise_05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## The fifth In-class-exercise (9/30/2020, 20 points in total)

In exercise-03, I asked you to collected 500 textual data based on your own information needs (If you didn't collect the textual data, you should recollect for this exercise). Now we need to think about how to represent the textual data for text classification. In this exercise, you are required to select 10 types of features (10 types of features but absolutely more than 10 features) in the followings feature list, then represent the 500 texts with these features. The output should be in the following format:
![image.png](attachment:image.png)

The feature list:

* (1) tf-idf features
* (2) POS-tag features: number of adjective, adverb, auxiliary, punctuation, complementizer, coordinating conjunction, subordinating conjunction, determiner, interjection, noun, possessor, preposition, pronoun, quantifier, verb, and other. (select some of them if you use pos-tag features)
* (3) Linguistic features:
  * number of right-branching nodes across all constituent types
  * number of right-branching nodes for NPs only
  * number of left-branching nodes across all constituent types
  * number of left-branching nodes for NPs only
  * number of premodifiers across all constituent types
  * number of premodifiers within NPs only
  * number of postmodifiers across all constituent types
  * number of postmodifiers within NPs only
  * branching index across all constituent types, i.e. the number of right-branching nodes minus number of left-branching nodes
  * branching index for NPs only
  * branching weight index: number of tokens covered by right-branching nodes minus number of tokens covered by left-branching nodes across all categories
  * branching weight index for NPs only 
  * modification index, i.e. the number of premodifiers minus the number of postmodifiers across all categories
  * modification index for NPs only
  * modification weight index: length in tokens of all premodifiers minus length in tokens of all postmodifiers across all categories
  * modification weight index for NPs only
  * coordination balance, i.e. the maximal length difference in coordinated constituents
  
  * density (density can be calculated using the ratio of folowing function words to content words) of determiners/quantifiers
  * density of pronouns
  * density of prepositions
  * density of punctuation marks, specifically commas and semicolons
  * density of auxiliary verbs
  * density of conjunctions
  * density of different pronoun types: Wh, 1st, 2nd, and 3rd person pronouns
  
  * maximal and average NP length
  * maximal and average AJP length
  * maximal and average PP length
  * maximal and average AVP length
  * sentence length

* Other features in your mind (ie., pre-defined patterns)

In [1]:
#Libraries that i am going to use for scrapping are:'requests','beautifulsoup','re'
#'request': I am going to use'request'module as it is used to make a request to a web page we are looking to scrap 
#'BeautifulSoup': It is a python package for parsing html and xml documents.It creates parse trees that is helpful to extract data easily

#Step1: Importing necessary modules used for web scrapping

from bs4 import BeautifulSoup
import requests
import re

#Step2: Need to download the webpage we are gonna scrap and 
#send request to url ,as a response the server sends the data and allows us to read html or xml page.

# I am downloading IMDB's Top rated data
url = 'http://www.imdb.com/chart/top'

# making a request for the webpage i am gonna scrap so that it returns me html page to a variable response
response = requests.get(url)

#Step3: Parse the page

# We know we have different parses that support beautifulsoup,like 'html.parser',htmllib5,lxml,etc.,
#so i am using one of the parsers,lxml for parsing html and xml documents
#Parse the  html or xml in the response variable and store it in beautifulsoup format
# I am using prettify() to make the html code look better and understandable
soup = BeautifulSoup(response.text, 'lxml')
print(soup.prettify())

#Step4: Find the content of the data(Extract the information)
#That is the data i want to extract
#Taking insights from the html structure output i am fetching for classes from which i want to extract data
##Use a list comprehension to call the get method on each BeautifulSoup object.

movies = soup.select('td.titleColumn')
links = [a.attrs.get('href') for a in soup.select('td.titleColumn a')]
crew = [a.attrs.get('title') for a in soup.select('td.titleColumn a')]
ratings = [b.attrs.get('data-value') for b in soup.select('td.posterColumn span[name=ir]')]
votes = [b.attrs.get('data-value') for b in soup.select('td.ratingColumn strong')]

#Step5: Save the text collected(Storing the data in the required format)
#I am converting to dictionary for better understanding

imdb = []
# Store each item into dictionary (data), then put those into a list (imdb)
for index in range(0, len(movies)):
    # Seperate movie into: 'place', 'title', 'year'
    movie_string = movies[index].get_text()
    movie = (' '.join(movie_string.split()).replace('.', ''))
    movie_title = movie[len(str(index))+1:-7]
    year = re.search('\((.*?)\)', movie_string).group(1)
    place = movie[:len(str(index))-(len(movie))]
    data = {"movie_title": movie_title,
            "year": year,
            "place": place,
            "star_cast": crew[index],
            "rating": ratings[index],
            
            "link": links[index]}
    imdb.append(data)

for item in imdb:
    print(item['place'], '-', item['movie_title'], '('+item['year']+') -', 'Starring:', item['star_cast'], '-' ,'Rating:', item['rating'], 'Link:', item['link'])

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
                <td class="ratingColumn">
                 <div class="seen-widget seen-widget-tt0405159 pending" data-titleid="tt0405159">
                  <div class="boundary">
                   <div class="popover">
                    <span class="delete">
                    </span>
                    <ol>
                     <li>
                      1
                     </li>
                     <li>
                      2
                     </li>
                     <li>
                      3
                     </li>
                     <li>
                      4
                     </li>
                     <li>
                      5
                     </li>
                     <li>
                      6
                     </li>
                     <li>
                      7
                     </li>
                     <li>
                      8
                     </li>
  

In [2]:
print(imdb)

[{'movie_title': 'The Shawshank Redemption', 'year': '1994', 'place': '1', 'star_cast': 'Frank Darabont (dir.), Tim Robbins, Morgan Freeman', 'rating': '9.222725405015765', 'link': '/title/tt0111161/'}, {'movie_title': 'The Godfather', 'year': '1972', 'place': '2', 'star_cast': 'Francis Ford Coppola (dir.), Marlon Brando, Al Pacino', 'rating': '9.14887070728633', 'link': '/title/tt0068646/'}, {'movie_title': 'The Godfather: Part II', 'year': '1974', 'place': '3', 'star_cast': 'Francis Ford Coppola (dir.), Al Pacino, Robert De Niro', 'rating': '8.98120498927575', 'link': '/title/tt0071562/'}, {'movie_title': 'The Dark Knight', 'year': '2008', 'place': '4', 'star_cast': 'Christopher Nolan (dir.), Christian Bale, Heath Ledger', 'rating': '8.973397282092545', 'link': '/title/tt0468569/'}, {'movie_title': '12 Angry Men', 'year': '1957', 'place': '5', 'star_cast': 'Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb', 'rating': '8.930463998252359', 'link': '/title/tt0050083/'}, {'movie_title': "Sc

In [3]:
#Finally i would like to save all the obtained data in some csv file
#Here i created the csv file 'IMDBs_Top_RatedData.csv'and saved all the required data in it for any further use

import csv
filename = 'IMDBs_Top_RatedData.csv'
with open(filename, 'w', newline='') as f: 
    w = csv.DictWriter(f,['movie_title','year','star_cast','rating','link','place']) 
    w.writeheader() 
    for item in imdb:
      w.writerow(item)

In [4]:
import pandas as pd
  
# reading the CSV file 
csvFile = pd.read_csv('IMDBs_Top_RatedData.csv') 
  
# displaying the contents of the CSV file 
print(csvFile) 

                   movie_title  year  ...               link  place
0     The Shawshank Redemption  1994  ...  /title/tt0111161/      1
1                The Godfather  1972  ...  /title/tt0068646/      2
2       The Godfather: Part II  1974  ...  /title/tt0071562/      3
3              The Dark Knight  2008  ...  /title/tt0468569/      4
4                 12 Angry Men  1957  ...  /title/tt0050083/      5
..                         ...   ...  ...                ...    ...
245             The Terminator  1984  ...  /title/tt0088247/    246
246                 Tangerines  2013  ...  /title/tt2991224/    247
247                    Aladdin  1992  ...  /title/tt0103639/    248
248  A Silent Voice: The Movie  2016  ...  /title/tt5323662/    249
249            Munna Bhai MBBS  2003  ...  /title/tt0374887/    250

[250 rows x 6 columns]


In [5]:
dat = pd.read_csv('IMDBs_Top_RatedData.csv')
print(dat.shape)
dat.head(5)

(250, 6)


Unnamed: 0,movie_title,year,star_cast,rating,link,place
0,The Shawshank Redemption,1994,"Frank Darabont (dir.), Tim Robbins, Morgan Fre...",9.222725,/title/tt0111161/,1
1,The Godfather,1972,"Francis Ford Coppola (dir.), Marlon Brando, Al...",9.148871,/title/tt0068646/,2
2,The Godfather: Part II,1974,"Francis Ford Coppola (dir.), Al Pacino, Robert...",8.981205,/title/tt0071562/,3
3,The Dark Knight,2008,"Christopher Nolan (dir.), Christian Bale, Heat...",8.973397,/title/tt0468569/,4
4,12 Angry Men,1957,"Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb",8.930464,/title/tt0050083/,5


In [6]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
# tokenize and build vocab
vectorizer.fit(dat)
# summarize
print(vectorizer.vocabulary_)
# encode document
vector = vectorizer.transform(dat)
# summarize encoded vector
print(vector.toarray())

{'movie_title': 1, 'year': 5, 'star_cast': 4, 'rating': 3, 'link': 0, 'place': 2}
[[0 1 0 0 0 0]
 [0 0 0 0 0 1]
 [0 0 0 0 1 0]
 [0 0 0 1 0 0]
 [1 0 0 0 0 0]
 [0 0 1 0 0 0]]


In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

# create the transform
vectorizer = TfidfVectorizer()
# tokenize and build vocab
vectorizer.fit(dat)
# summarize
print(vectorizer.vocabulary_)
print(vectorizer.idf_)
# encode document
vector = vectorizer.transform(dat)
# summarize encoded vector

print(vector.toarray())

{'movie_title': 1, 'year': 5, 'star_cast': 4, 'rating': 3, 'link': 0, 'place': 2}
[2.25276297 2.25276297 2.25276297 2.25276297 2.25276297 2.25276297]
[[0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0. 0.]
 [1. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0.]]


In [8]:

from sklearn.feature_extraction.text import HashingVectorizer

vectorizer = HashingVectorizer(n_features=50)
# encode document
vector = vectorizer.transform(dat)
# summarize encoded vector
print(vector.shape)
print(vector.toarray())

(6, 50)
[[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0. -1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0. -1.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0

In [9]:
file = open("IMDBs_Top_RatedData.csv")
reader = csv.reader(file)

In [10]:
data = [
    [(word.replace(",", "")
          .replace(".", "")
          .replace("(", "")
          .replace(")", ""))
    for word in row[2].lower().split()]
    for row in reader]

#Removes header
data = data[1:]

In [11]:
data[0:250]

[['frank', 'darabont', 'dir', 'tim', 'robbins', 'morgan', 'freeman'],
 ['francis', 'ford', 'coppola', 'dir', 'marlon', 'brando', 'al', 'pacino'],
 ['francis', 'ford', 'coppola', 'dir', 'al', 'pacino', 'robert', 'de', 'niro'],
 ['christopher', 'nolan', 'dir', 'christian', 'bale', 'heath', 'ledger'],
 ['sidney', 'lumet', 'dir', 'henry', 'fonda', 'lee', 'j', 'cobb'],
 ['steven', 'spielberg', 'dir', 'liam', 'neeson', 'ralph', 'fiennes'],
 ['peter', 'jackson', 'dir', 'elijah', 'wood', 'viggo', 'mortensen'],
 ['quentin', 'tarantino', 'dir', 'john', 'travolta', 'uma', 'thurman'],
 ['sergio', 'leone', 'dir', 'clint', 'eastwood', 'eli', 'wallach'],
 ['peter', 'jackson', 'dir', 'elijah', 'wood', 'ian', 'mckellen'],
 ['david', 'fincher', 'dir', 'brad', 'pitt', 'edward', 'norton'],
 ['robert', 'zemeckis', 'dir', 'tom', 'hanks', 'robin', 'wright'],
 ['christopher',
  'nolan',
  'dir',
  'leonardo',
  'dicaprio',
  'joseph',
  'gordon-levitt'],
 ['peter', 'jackson', 'dir', 'elijah', 'wood', 'ian', '

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer()
x = v.fit_transform(dat)#['movie_title'])

In [13]:
x.toarray()

array([[0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0., 0.],
       [1., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0.]])

In [14]:
!pip install texthero

Collecting texthero
  Downloading https://files.pythonhosted.org/packages/1f/5a/a9d33b799fe53011de79d140ad6d86c440a2da1ae8a7b24e851ee2f8bde8/texthero-1.0.9-py3-none-any.whl
Collecting nltk>=3.3
[?25l  Downloading https://files.pythonhosted.org/packages/92/75/ce35194d8e3022203cca0d2f896dbb88689f9b3fce8e9f9cff942913519d/nltk-3.5.zip (1.4MB)
[K     |████████████████████████████████| 1.4MB 7.3MB/s 
Collecting unidecode>=1.1.1
[?25l  Downloading https://files.pythonhosted.org/packages/d0/42/d9edfed04228bacea2d824904cae367ee9efd05e6cce7ceaaedd0b0ad964/Unidecode-1.1.1-py2.py3-none-any.whl (238kB)
[K     |████████████████████████████████| 245kB 38.4MB/s 
Building wheels for collected packages: nltk
  Building wheel for nltk (setup.py) ... [?25l[?25hdone
  Created wheel for nltk: filename=nltk-3.5-cp36-none-any.whl size=1434678 sha256=b5e8d0cfd91da071f44c3e58b9b0a704ae1c745974840a9e85851462a7c176d2
  Stored in directory: /root/.cache/pip/wheels/ae/8c/3f/b1fe0ba04555b08b57ab52ab7f86023639a

In [15]:
import texthero as hero
dat['tfidf'] = hero.tfidf(dat['movie_title'])
dat['tfidf1'] = hero.tfidf(dat['star_cast'])

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [16]:
dat.head(250)

Unnamed: 0,movie_title,year,star_cast,rating,link,place,tfidf,tfidf1
0,The Shawshank Redemption,1994,"Frank Darabont (dir.), Tim Robbins, Morgan Fre...",9.222725,/title/tt0111161/,1,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.07569941646276615, 0.0, 0.0, 0.0, 0.0, 0.0,..."
1,The Godfather,1972,"Francis Ford Coppola (dir.), Marlon Brando, Al...",9.148871,/title/tt0068646/,2,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.07311174033826791, 0.0, 0.0, 0.0, 0.0, 0.0,..."
2,The Godfather: Part II,1974,"Francis Ford Coppola (dir.), Al Pacino, Robert...",8.981205,/title/tt0071562/,3,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.07453921922394281, 0.0, 0.0, 0.0, 0.0, 0.0,..."
3,The Dark Knight,2008,"Christopher Nolan (dir.), Christian Bale, Heat...",8.973397,/title/tt0468569/,4,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.08079076718313778, 0.0, 0.0, 0.0, 0.0, 0.0,..."
4,12 Angry Men,1957,"Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb",8.930464,/title/tt0050083/,5,"[0.0, 0.0, 0.562988475313629, 0.0, 0.0, 0.0, 0...","[0.07035351957540238, 0.0, 0.0, 0.0, 0.0, 0.0,..."
...,...,...,...,...,...,...,...,...
245,The Terminator,1984,"James Cameron (dir.), Arnold Schwarzenegger, L...",8.009042,/title/tt0088247/,246,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.07883578785645051, 0.0, 0.0, 0.0, 0.0, 0.0,..."
246,Tangerines,2013,"Zaza Urushadze (dir.), Lembit Ulfsak, Elmo Nüg...",8.008171,/title/tt2991224/,247,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.06982689584705071, 0.0, 0.0, 0.0, 0.0, 0.0,..."
247,Aladdin,1992,"Ron Clements (dir.), Scott Weinger, Robin Will...",8.007445,/title/tt0103639/,248,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.07636743776884315, 0.0, 0.0, 0.0, 0.0, 0.0,..."
248,A Silent Voice: The Movie,2016,"Naoko Yamada (dir.), Miyu Irino, Saori Hayami",8.007410,/title/tt5323662/,249,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.06982689584705071, 0.0, 0.0, 0.0, 0.0, 0.0,..."


In [51]:
tf1 = (dat['movie_title'][0:250]).apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0).reset_index()
tf1.columns = ['words','tf']
tf1

Unnamed: 0,words,tf
0,Redemption,1.0
1,Shawshank,1.0
2,The,58.0
3,Godfather,1.0
4,Part,2.0
...,...,...
483,Movie,1.0
484,Silent,1.0
485,Munna,1.0
486,MBBS,1.0


In [53]:
import numpy as np
for i,word in enumerate(tf1['words']):
  tf1.loc[i, 'idf'] = np.log(dat.shape[0]/(len(dat[dat['movie_title'].str.contains(word)])))

tf1

Unnamed: 0,words,tf,idf
0,Redemption,1.0,5.521461
1,Shawshank,1.0,5.521461
2,The,58.0,1.496109
3,Godfather,1.0,4.828314
4,Part,2.0,4.828314
...,...,...,...
483,Movie,1.0,5.521461
484,Silent,1.0,5.521461
485,Munna,1.0,5.521461
486,MBBS,1.0,5.521461


In [54]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='word',
 stop_words= 'english',ngram_range=(1,1))
train_vect = tfidf.fit_transform(dat['movie_title'])

train_vect

<250x408 sparse matrix of type '<class 'numpy.float64'>'
	with 459 stored elements in Compressed Sparse Row format>

In [55]:
from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer(max_features=1000, lowercase=True, ngram_range=(1,1),analyzer = "word")
train_bow = bow.fit_transform(dat['movie_title'])
train_bow

<250x463 sparse matrix of type '<class 'numpy.int64'>'
	with 655 stored elements in Compressed Sparse Row format>

In [38]:
import nltk
nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import word_tokenize
nltk.download('punkt')
try1=[]
for i in dat['movie_title']:
  text = word_tokenize(i)
  try1.append(nltk.pos_tag(text))

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [39]:
c_nn=0
c_vb=0

for i in try1:
  for j,k in i:
    
    if k=="NN":
      c_nn+=1
    if k=="JJ":
      c_vb+=1

In [40]:
c_nn

98

In [37]:
import spacy
nlp=spacy.load("en_core_web_sm")
for i in dat['movie_title']:
  for token in nlp(i):
    print(token.text,"=>",token.pos_,"=>",token.tag_)

The => DET => DT
Shawshank => PROPN => NNP
Redemption => PROPN => NNP
The => DET => DT
Godfather => PROPN => NNP
The => DET => DT
Godfather => PROPN => NNP
: => PUNCT => :
Part => NOUN => NN
II => PROPN => NNP
The => DET => DT
Dark => PROPN => NNP
Knight => PROPN => NNP
12 => NUM => CD
Angry => PROPN => NNP
Men => PROPN => NNPS
Schindler => PROPN => NNP
's => PART => POS
List => NOUN => NN
The => DET => DT
Lord => PROPN => NNP
of => ADP => IN
the => DET => DT
Rings => PROPN => NNPS
: => PUNCT => :
The => DET => DT
Return => NOUN => NN
of => ADP => IN
the => DET => DT
King => PROPN => NNP
Pulp => NOUN => NN
Fiction => NOUN => NN
The => DET => DT
Good => PROPN => NNP
, => PUNCT => ,
the => DET => DT
Bad => PROPN => NNP
and => CCONJ => CC
the => DET => DT
Ugly => PROPN => NNP
  => SPACE => _SP
The => DET => DT
Lord => PROPN => NNP
of => ADP => IN
the => DET => DT
Rings => PROPN => NNPS
: => PUNCT => :
The => DET => DT
Fellowship => PROPN => NNP
of => ADP => IN
the => DET => DT
Ring => NOU

In [41]:
import spacy
nlp=spacy.load("en_core_web_sm")
for i in dat['star_cast']:
  for token in nlp(i):
    print(token.text,"=>",token.pos_,"=>",token.tag_)

Frank => PROPN => NNP
Darabont => PROPN => NNP
( => PUNCT => -LRB-
dir => PROPN => NNP
. => PUNCT => .
) => PUNCT => -RRB-
, => PUNCT => ,
Tim => PROPN => NNP
Robbins => PROPN => NNP
, => PUNCT => ,
Morgan => PROPN => NNP
Freeman => PROPN => NNP
Francis => PROPN => NNP
Ford => PROPN => NNP
Coppola => PROPN => NNP
( => PUNCT => -LRB-
dir => NOUN => NNS
. => PUNCT => .
) => PUNCT => -RRB-
, => PUNCT => ,
Marlon => PROPN => NNP
Brando => PROPN => NNP
, => PUNCT => ,
Al => PROPN => NNP
Pacino => PROPN => NNP
Francis => PROPN => NNP
Ford => PROPN => NNP
Coppola => PROPN => NNP
( => PUNCT => -LRB-
dir => NOUN => NNS
. => PUNCT => .
) => PUNCT => -RRB-
, => PUNCT => ,
Al => PROPN => NNP
Pacino => PROPN => NNP
, => PUNCT => ,
Robert => PROPN => NNP
De => PROPN => NNP
Niro => PROPN => NNP
Christopher => PROPN => NNP
Nolan => PROPN => NNP
( => PUNCT => -LRB-
dir => PROPN => NNP
. => PUNCT => .
) => PUNCT => -RRB-
, => PUNCT => ,
Christian => PROPN => NNP
Bale => PROPN => NNP
, => PUNCT => ,
Heat

In [62]:


def number_of_cast(row):
    return len(row["star_cast"])

def number_of_links(row):
    return len(row["link"])

def number_of_movie(row):
    return len(row["movie_title"])

def clean_text(row):
    clean = row["movie_title"]

    
    #only remove the # symbol
    clean = clean.replace("#", "").replace("/", "").replace("(", "").replace(")", "")
    
    return clean.strip()



dat["number_of_cast"] = dat.apply(lambda row: number_of_cast(row), axis = 1)
dat["number_of_links"] = dat.apply(lambda row: number_of_links(row), axis = 1)
dat["number_of_movie"] = dat.apply(lambda row: number_of_movie(row), axis = 1)

dat["clean_text"] = dat.apply(lambda row: clean_text(row), axis = 1)
dat.sample(250)

Unnamed: 0,movie_title,year,star_cast,rating,link,place,tfidf,tfidf1,number_of_cast,number_of_links,number_of_movie,clean_text
132,Unforgiven,1992,"Clint Eastwood (dir.), Clint Eastwood, Gene Ha...",8.180544,/title/tt0105695/,133,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.06989829494425324, 0.0, 0.0, 0.0, 0.0, 0.0,...",51,17,10,Unforgiven
155,Shutter Island,2010,"Martin Scorsese (dir.), Leonardo DiCaprio, Emi...",8.125708,/title/tt1130884/,156,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.08147304262259082, 0.0, 0.0, 0.0, 0.0, 0.0,...",57,17,14,Shutter Island
247,Aladdin,1992,"Ron Clements (dir.), Scott Weinger, Robin Will...",8.007445,/title/tt0103639/,248,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.07636743776884315, 0.0, 0.0, 0.0, 0.0, 0.0,...",50,17,7,Aladdin
67,Princess Mononoke,1997,"Hayao Miyazaki (dir.), Yôji Matsuda, Yuriko Is...",8.351394,/title/tt0119698/,68,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0741496109751322, 0.0, 0.0, 0.0, 0.0, 0.0, ...",50,17,17,Princess Mononoke
35,The Pianist,2002,"Roman Polanski (dir.), Adrien Brody, Thomas Kr...",8.486517,/title/tt0253474/,36,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.07330074836544564, 0.0, 0.0, 0.0, 0.0, 0.0,...",55,17,11,The Pianist
...,...,...,...,...,...,...,...,...,...,...,...,...
234,Portrait of a Lady on Fire,2019,"Céline Sciamma (dir.), Noémie Merlant, Adèle H...",8.020909,/title/tt8613070/,235,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.06982689584705071, 0.0, 0.0, 0.0, 0.0, 0.0,...",51,17,26,Portrait of a Lady on Fire
127,Batman Begins,2005,"Christopher Nolan (dir.), Christian Bale, Mich...",8.190215,/title/tt0372784/,128,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.08542856665758602, 0.0, 0.0, 0.0, 0.0, 0.0,...",55,17,13,Batman Begins
204,Mad Max: Fury Road,2015,"George Miller (dir.), Tom Hardy, Charlize Theron",8.061643,/title/tt1392190/,205,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.07729695979214617, 0.0, 0.0, 0.0, 0.0, 0.0,...",48,17,18,Mad Max: Fury Road
125,Die Hard,1988,"John McTiernan (dir.), Bruce Willis, Alan Rickman",8.194367,/title/tt0095016/,126,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.07515342803841056, 0.0, 0.0, 0.0, 0.0, 0.0,...",49,17,8,Die Hard


In [64]:
from nltk.tokenize import word_tokenize

def get_tokens(row):
    return word_tokenize(row["clean_text"].lower())

dat["tokens"] = dat.apply(lambda row: get_tokens(row), axis = 1)
dat.sample(5, random_state = 4)

Unnamed: 0,movie_title,year,star_cast,rating,link,place,tfidf,tfidf1,number_of_cast,number_of_links,number_of_movie,clean_text,tokens
33,The Lion King,1994,"Roger Allers (dir.), Matthew Broderick, Jeremy...",8.49198,/title/tt0110357/,34,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.07115862564277592, 0.0, 0.0, 0.0, 0.0, 0.0,...",52,17,13,The Lion King,"[the, lion, king]"
213,Into the Wild,2007,"Sean Penn (dir.), Emile Hirsch, Vince Vaughn",8.051746,/title/tt0758758/,214,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.07115862564277592, 0.0, 0.0, 0.0, 0.0, 0.0,...",44,17,13,Into the Wild,"[into, the, wild]"
39,Hamilton,2020,"Thomas Kail (dir.), Lin-Manuel Miranda, Philli...",8.48133,/title/tt8503618/,40,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.07156633117467874, 0.0, 0.0, 0.0, 0.0, 0.0,...",52,17,8,Hamilton,[hamilton]
6,The Lord of the Rings: The Return of the King,2003,"Peter Jackson (dir.), Elijah Wood, Viggo Morte...",8.883515,/title/tt0167260/,7,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0786811880970543, 0.0, 0.0, 0.0, 0.0, 0.0, ...",50,17,45,The Lord of the Rings: The Return of the King,"[the, lord, of, the, rings, :, the, return, of..."
101,A Clockwork Orange,1971,"Stanley Kubrick (dir.), Malcolm McDowell, Patr...",8.245797,/title/tt0066921/,102,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.07667449172309425, 0.0, 0.0, 0.0, 0.0, 0.0,...",55,17,18,A Clockwork Orange,"[a, clockwork, orange]"


In [65]:
s = ["of", "in", "the", "A"]

def get_postags(row):
    
    postags = nltk.pos_tag(row["tokens"])
    list_classes = list()
    for  word in postags:
        list_classes.append(word[1])
    
    return list_classes

dat["postags"] = dat.apply(lambda row: get_postags(row), axis = 1)
dat.sample(5, random_state = 4)

Unnamed: 0,movie_title,year,star_cast,rating,link,place,tfidf,tfidf1,number_of_cast,number_of_links,number_of_movie,clean_text,tokens,postags
33,The Lion King,1994,"Roger Allers (dir.), Matthew Broderick, Jeremy...",8.49198,/title/tt0110357/,34,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.07115862564277592, 0.0, 0.0, 0.0, 0.0, 0.0,...",52,17,13,The Lion King,"[the, lion, king]","[DT, NN, NN]"
213,Into the Wild,2007,"Sean Penn (dir.), Emile Hirsch, Vince Vaughn",8.051746,/title/tt0758758/,214,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.07115862564277592, 0.0, 0.0, 0.0, 0.0, 0.0,...",44,17,13,Into the Wild,"[into, the, wild]","[IN, DT, JJ]"
39,Hamilton,2020,"Thomas Kail (dir.), Lin-Manuel Miranda, Philli...",8.48133,/title/tt8503618/,40,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.07156633117467874, 0.0, 0.0, 0.0, 0.0, 0.0,...",52,17,8,Hamilton,[hamilton],[NN]
6,The Lord of the Rings: The Return of the King,2003,"Peter Jackson (dir.), Elijah Wood, Viggo Morte...",8.883515,/title/tt0167260/,7,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0786811880970543, 0.0, 0.0, 0.0, 0.0, 0.0, ...",50,17,45,The Lord of the Rings: The Return of the King,"[the, lord, of, the, rings, :, the, return, of...","[DT, NN, IN, DT, NNS, :, DT, NN, IN, DT, NN]"
101,A Clockwork Orange,1971,"Stanley Kubrick (dir.), Malcolm McDowell, Patr...",8.245797,/title/tt0066921/,102,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.07667449172309425, 0.0, 0.0, 0.0, 0.0, 0.0,...",55,17,18,A Clockwork Orange,"[a, clockwork, orange]","[DT, NN, NN]"


In [66]:
def find_no_class(count, class_name = ""):
    total = 0
    for key in count.keys():
        if key.startswith(class_name):
            total += count[key]
            
            
    return total

def get_classes(row, grammatical_class = ""):
    count = Counter(row["postags"])
    return find_no_class(count, class_name = grammatical_class)/len(row["postags"])

dat["freqAdverbs"] = dat.apply(lambda row: get_classes(row, "RB"), axis = 1)
dat["freqVerbs"] = dat.apply(lambda row: get_classes(row, "VB"), axis = 1)
dat["freqAdjectives"] = dat.apply(lambda row: get_classes(row, "JJ"), axis = 1)
dat["freqNouns"] = dat.apply(lambda row: get_classes(row, "NN"), axis = 1)

dat.sample(250)

Unnamed: 0,movie_title,year,star_cast,rating,link,place,tfidf,tfidf1,number_of_cast,number_of_links,number_of_movie,clean_text,tokens,postags,freqAdverbs,freqVerbs,freqAdjectives,freqNouns
241,The Circus,1928,"Charles Chaplin (dir.), Charles Chaplin, Merna...",8.012254,/title/tt0018773/,242,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.07425866520463416, 0.0, 0.0, 0.0, 0.0, 0.0,...",54,17,10,The Circus,"[the, circus]","[DT, NN]",0.000000,0.000000,0.0,0.500000
74,Your Name,2016,"Makoto Shinkai (dir.), Ryûnosuke Kamiki, Mone ...",8.320145,/title/tt5311514/,75,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.06982689584705071, 0.0, 0.0, 0.0, 0.0, 0.0,...",59,17,9,Your Name,"[your, name]","[PRP$, NN]",0.000000,0.000000,0.0,0.500000
196,Prisoners,2013,"Denis Villeneuve (dir.), Hugh Jackman, Jake Gy...",8.073125,/title/tt1392214/,197,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.07469571531568694, 0.0, 0.0, 0.0, 0.0, 0.0,...",54,17,9,Prisoners,[prisoners],[NNS],0.000000,0.000000,0.0,1.000000
211,Ben-Hur,1959,"William Wyler (dir.), Charlton Heston, Jack Ha...",8.052853,/title/tt0052618/,212,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.07444400100082015, 0.0, 0.0, 0.0, 0.0, 0.0,...",51,17,7,Ben-Hur,[ben-hur],[NN],0.000000,0.000000,0.0,1.000000
42,City Lights,1931,"Charles Chaplin (dir.), Charles Chaplin, Virgi...",8.472443,/title/tt0021749/,43,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.07334125455175967, 0.0, 0.0, 0.0, 0.0, 0.0,...",58,17,11,City Lights,"[city, lights]","[NN, NNS]",0.000000,0.000000,0.0,1.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
146,My Neighbor Totoro,1988,"Hayao Miyazaki (dir.), Hitoshi Takagi, Noriko ...",8.146608,/title/tt0096283/,147,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0741496109751322, 0.0, 0.0, 0.0, 0.0, 0.0, ...",52,17,18,My Neighbor Totoro,"[my, neighbor, totoro]","[PRP$, NN, NN]",0.000000,0.000000,0.0,0.666667
157,V for Vendetta,2005,"James McTeigue (dir.), Hugo Weaving, Natalie P...",8.124392,/title/tt0434409/,158,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.07317332463216965, 0.0, 0.0, 0.0, 0.0, 0.0,...",52,17,14,V for Vendetta,"[v, for, vendetta]","[NN, IN, NN]",0.000000,0.000000,0.0,0.666667
14,Star Wars: Episode V - The Empire Strikes Back,1980,"Irvin Kershner (dir.), Mark Hamill, Harrison Ford",8.698538,/title/tt0080684/,15,"[0.3325448493982908, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.07867575647464492, 0.0, 0.0, 0.0, 0.0, 0.0,...",49,17,46,Star Wars: Episode V - The Empire Strikes Back,"[star, wars, :, episode, v, -, the, empire, st...","[NN, NNS, :, NN, SYM, :, DT, NN, VBZ, RB]",0.100000,0.100000,0.0,0.400000
242,The Help,2011,"Tate Taylor (dir.), Emma Stone, Viola Davis",8.011752,/title/tt1454029/,243,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0706172511978718, 0.0, 0.0, 0.0, 0.0, 0.0, ...",43,17,8,The Help,"[the, help]","[DT, NN]",0.000000,0.000000,0.0,0.500000
