## The fifth In-class-exercise (9/30/2020, 20 points in total)

In exercise-03, I asked you to collected 500 textual data based on your own information needs (If you didn't collect the textual data, you should recollect for this exercise). Now we need to think about how to represent the textual data for text classification. In this exercise, you are required to select 10 types of features (10 types of features but absolutely more than 10 features) in the followings feature list, then represent the 500 texts with these features. The output should be in the following format:
![image.png](attachment:image.png)

The feature list:

* (1) tf-idf features
* (2) POS-tag features: number of adjective, adverb, auxiliary, punctuation, complementizer, coordinating conjunction, subordinating conjunction, determiner, interjection, noun, possessor, preposition, pronoun, quantifier, verb, and other. (select some of them if you use pos-tag features)
* (3) Linguistic features:
  * number of right-branching nodes across all constituent types
  * number of right-branching nodes for NPs only
  * number of left-branching nodes across all constituent types
  * number of left-branching nodes for NPs only
  * number of premodifiers across all constituent types
  * number of premodifiers within NPs only
  * number of postmodifiers across all constituent types
  * number of postmodifiers within NPs only
  * branching index across all constituent types, i.e. the number of right-branching nodes minus number of left-branching nodes
  * branching index for NPs only
  * branching weight index: number of tokens covered by right-branching nodes minus number of tokens covered by left-branching nodes across all categories
  * branching weight index for NPs only 
  * modification index, i.e. the number of premodifiers minus the number of postmodifiers across all categories
  * modification index for NPs only
  * modification weight index: length in tokens of all premodifiers minus length in tokens of all postmodifiers across all categories
  * modification weight index for NPs only
  * coordination balance, i.e. the maximal length difference in coordinated constituents
  
  * density (density can be calculated using the ratio of folowing function words to content words) of determiners/quantifiers
  * density of pronouns
  * density of prepositions
  * density of punctuation marks, specifically commas and semicolons
  * density of auxiliary verbs
  * density of conjunctions
  * density of different pronoun types: Wh, 1st, 2nd, and 3rd person pronouns
  
  * maximal and average NP length
  * maximal and average AJP length
  * maximal and average PP length
  * maximal and average AVP length
  * sentence length

* Other features in your mind (ie., pre-defined patterns)

# **Extracting Data**

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
main_text = [] # List to store Review headings
sub_text =[] #List to store reviews
for number in range(52):
  link = "https://www.flipkart.com/nokia-139cm-55-inch-ultra-hd-4k-led-smart-android-tv-sound-jbl/product-reviews/itmffvfvyztsmfmq?pid=TVSFFVFVJEGZ3R5H&lid=LSTTVSFFVFVJEGZ3R5HW9DJ5W&marketplace=FLIPKART&page=" + str(number) # Generating link dynamically
  page = requests.get(link) # Accessing the webpage
  soup = BeautifulSoup(page.text, 'html.parser')
  main_reviews = soup.find_all(class_='_2xg6Ul') # Getting the Review Heading by using the class name
  text_reviews = soup.find_all(class_='qwjRop') # Getting the full reviews by using the class name
  for ele, sub_ele in zip(main_reviews, text_reviews) : # Iterating through the list
      main_text.append(ele.text) #Appending to empty list
      sub_text.append(sub_ele.text)
df = pd.DataFrame(list(zip(main_text, sub_text)), columns =['Glimpse of Review', 'Full Review'])  # Creating Dataframe
print("Length of data frame is {0}".format(len(df)))
df

Length of data frame is 510


Unnamed: 0,Glimpse of Review,Full Review
0,Review from Technology Gyan: Almost everything...,"This 55"" 4K Nokia TV at this price point comes..."
1,Terrific purchase,"This might seem to be awkward, but this is the..."
2,Brilliant,I must say it is best decision to by Nokia TV....
3,Best in the market!,Flipcart delivered the Product in less than 24...
4,Pretty good,"Pros1) Picture Quality is good, micro dimming ..."
...,...,...
505,Best in the market!,Best Quality for nokiaREAD MORE
506,Horrible,"Although all things good about this tv, one th..."
507,Wonderful,Very nice tv & sound is bestREAD MORE
508,Nice,Some bleeding on the screen. Sound is okay. Pi...


# **Preprocessing Data**

**Converting to lower case**

In [None]:
df['After Preprocessing'] = df['Full Review'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df

Unnamed: 0,Glimpse of Review,Full Review,After Preprocessing
0,Review from Technology Gyan: Almost everything...,"This 55"" 4K Nokia TV at this price point comes...","this 55"" 4k nokia tv at this price point comes..."
1,Terrific purchase,"This might seem to be awkward, but this is the...","this might seem to be awkward, but this is the..."
2,Brilliant,I must say it is best decision to by Nokia TV....,i must say it is best decision to by nokia tv....
3,Best in the market!,Flipcart delivered the Product in less than 24...,flipcart delivered the product in less than 24...
4,Pretty good,"Pros1) Picture Quality is good, micro dimming ...","pros1) picture quality is good, micro dimming ..."
...,...,...,...
505,Best in the market!,Best Quality for nokiaREAD MORE,best quality for nokiaread more
506,Horrible,"Although all things good about this tv, one th...","although all things good about this tv, one th..."
507,Wonderful,Very nice tv & sound is bestREAD MORE,very nice tv & sound is bestread more
508,Nice,Some bleeding on the screen. Sound is okay. Pi...,some bleeding on the screen. sound is okay. pi...


**Removing Punctuation**

In [None]:
df['After Preprocessing'] = df['After Preprocessing'].str.replace('[^\w\s]','')
df

Unnamed: 0,Glimpse of Review,Full Review,After Preprocessing
0,Review from Technology Gyan: Almost everything...,"This 55"" 4K Nokia TV at this price point comes...",this 55 4k nokia tv at this price point comes ...
1,Terrific purchase,"This might seem to be awkward, but this is the...",this might seem to be awkward but this is the ...
2,Brilliant,I must say it is best decision to by Nokia TV....,i must say it is best decision to by nokia tv ...
3,Best in the market!,Flipcart delivered the Product in less than 24...,flipcart delivered the product in less than 24...
4,Pretty good,"Pros1) Picture Quality is good, micro dimming ...",pros1 picture quality is good micro dimming wo...
...,...,...,...
505,Best in the market!,Best Quality for nokiaREAD MORE,best quality for nokiaread more
506,Horrible,"Although all things good about this tv, one th...",although all things good about this tv one thi...
507,Wonderful,Very nice tv & sound is bestREAD MORE,very nice tv sound is bestread more
508,Nice,Some bleeding on the screen. Sound is okay. Pi...,some bleeding on the screen sound is okay pict...


**Removing Numerics**

In [None]:
df['After Preprocessing'] = df['After Preprocessing'].apply(lambda x: ''.join([i for i in x if not i.isdigit()]))

**Removing Special Characters**

In [None]:
import re
df['After Preprocessing'] = df['After Preprocessing'].apply(lambda x: ''.join(re.sub(r"[^a-zA-Z0-9]+", ' ', charctr) for charctr in x ))
df

Unnamed: 0,Glimpse of Review,Full Review,After Preprocessing
0,Review from Technology Gyan: Almost everything...,"This 55"" 4K Nokia TV at this price point comes...",this k nokia tv at this price point comes wit...
1,Terrific purchase,"This might seem to be awkward, but this is the...",this might seem to be awkward but this is the ...
2,Brilliant,I must say it is best decision to by Nokia TV....,i must say it is best decision to by nokia tv ...
3,Best in the market!,Flipcart delivered the Product in less than 24...,flipcart delivered the product in less than h...
4,Pretty good,"Pros1) Picture Quality is good, micro dimming ...",pros picture quality is good micro dimming wor...
...,...,...,...
505,Best in the market!,Best Quality for nokiaREAD MORE,best quality for nokiaread more
506,Horrible,"Although all things good about this tv, one th...",although all things good about this tv one thi...
507,Wonderful,Very nice tv & sound is bestREAD MORE,very nice tv sound is bestread more
508,Nice,Some bleeding on the screen. Sound is okay. Pi...,some bleeding on the screen sound is okay pict...


In [None]:
import nltk
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> all
    Downloading collection 'all'
       | 
       | Downloading package abc to /root/nltk_data...
       |   Package abc is already up-to-date!
       | Downloading package alpino to /root/nltk_data...
       |   Package alpino is already up-to-date!
       | Downloading package biocreative_ppi to /root/nltk_data...
       |   Package biocreative_ppi is already up-to-date!
       | Downloading package brown to /root/nltk_data...
       |   Package brown is already up-to-date!
       | Downloading package brown_tei to /root/nltk_data...
       |   Package brown_tei is already up-to-date!
       | Downloading package cess_cat to /root/nltk_data...
       |   Package cess_cat is 

True

**Removing Stop Words**

In [None]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
df['After Preprocessing'] = df['After Preprocessing'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df

Unnamed: 0,Glimpse of Review,Full Review,After Preprocessing
0,Review from Technology Gyan: Almost everything...,"This 55"" 4K Nokia TV at this price point comes...",k nokia tv price point comes almost everything...
1,Terrific purchase,"This might seem to be awkward, but this is the...",might seem awkward first led tv home flat tv n...
2,Brilliant,I must say it is best decision to by Nokia TV....,must say best decision nokia tv device feature...
3,Best in the market!,Flipcart delivered the Product in less than 24...,flipcart delivered product less hrs really ama...
4,Pretty good,"Pros1) Picture Quality is good, micro dimming ...",pros picture quality good micro dimming works ...
...,...,...,...
505,Best in the market!,Best Quality for nokiaREAD MORE,best quality nokiaread
506,Horrible,"Although all things good about this tv, one th...",although things good tv one thing found displa...
507,Wonderful,Very nice tv & sound is bestREAD MORE,nice tv sound bestread
508,Nice,Some bleeding on the screen. Sound is okay. Pi...,bleeding screen sound okay picture quality goo...


**Spelling Correction**

In [None]:
from textblob import TextBlob
df['After Preprocessing'].apply(lambda x: str(TextBlob(x).correct()))

0      k nikita to price point comes almost everythin...
1      might seem awkward first led to home flat to n...
2      must say best decision nikita to device featur...
3      flipcart delivered product less his really ama...
4      pro picture quality good micro dining works we...
                             ...                        
505                               best quality nokiaread
506    although things good to one thing found displa...
507                               nice to sound bedstead
508    bleeding screen sound okay picture quality goo...
509    got today now five stars really amazing produc...
Name: After Preprocessing, Length: 510, dtype: object

**Stemming**

In [None]:
from nltk.stem import PorterStemmer
st = PorterStemmer()
df['After Preprocessing'].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))

0      k nokia tv price point come almost everyth exp...
1      might seem awkward first led tv home flat tv n...
2      must say best decis nokia tv devic featur load...
3      flipcart deliv product le hr realli amaz quick...
4      pro pictur qualiti good micro dim work well jb...
                             ...                        
505                               best qualiti nokiaread
506    although thing good tv one thing found display...
507                               nice tv sound bestread
508    bleed screen sound okay pictur qualiti good go...
509    got today wow five star realli amaz product qu...
Name: After Preprocessing, Length: 510, dtype: object

**Lemmatization**

In [None]:
from textblob import Word
import nltk
nltk.download('wordnet')

df['After Preprocessing'] = df['After Preprocessing'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
df

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,Glimpse of Review,Full Review,After Preprocessing
0,Review from Technology Gyan: Almost everything...,"This 55"" 4K Nokia TV at this price point comes...",k nokia tv price point come almost everything ...
1,Terrific purchase,"This might seem to be awkward, but this is the...",might seem awkward first led tv home flat tv n...
2,Brilliant,I must say it is best decision to by Nokia TV....,must say best decision nokia tv device feature...
3,Best in the market!,Flipcart delivered the Product in less than 24...,flipcart delivered product le hr really amazed...
4,Pretty good,"Pros1) Picture Quality is good, micro dimming ...",pro picture quality good micro dimming work we...
...,...,...,...
505,Best in the market!,Best Quality for nokiaREAD MORE,best quality nokiaread
506,Horrible,"Although all things good about this tv, one th...",although thing good tv one thing found display...
507,Wonderful,Very nice tv & sound is bestREAD MORE,nice tv sound bestread
508,Nice,Some bleeding on the screen. Sound is okay. Pi...,bleeding screen sound okay picture quality goo...


# **Parts of Speech Tagging and Features**

In [None]:
from nltk.tokenize import word_tokenize
pos = []
for sentence in df['After Preprocessing']:
  text = word_tokenize(sentence)
  pos.append(nltk.pos_tag(text))
pos

[[('k', 'NN'),
  ('nokia', 'JJ'),
  ('tv', 'NN'),
  ('price', 'NN'),
  ('point', 'NN'),
  ('come', 'VB'),
  ('almost', 'RB'),
  ('everything', 'NN'),
  ('expect', 'JJ'),
  ('tv', 'NN'),
  ('today', 'NN'),
  ('time', 'NN'),
  ('pro', 'JJ'),
  ('con', 'NN'),
  ('according', 'VBG'),
  ('experiencedpros', 'JJ'),
  ('great', 'JJ'),
  ('quality', 'NN'),
  ('display', 'NN'),
  ('thin', 'JJ'),
  ('bezel', 'NN'),
  ('presence', 'NN'),
  ('almost', 'RB'),
  ('thin', 'JJ'),
  ('bezel', 'NNS'),
  ('change', 'VBP'),
  ('whole', 'JJ'),
  ('experience', 'NN'),
  ('viewing', 'VBG'),
  ('played', 'VBN'),
  ('high', 'JJ'),
  ('resolution', 'NN'),
  ('wildlife', 'NN'),
  ('video', 'NN'),
  ('youtube', 'NN'),
  ('amazed', 'VBD'),
  ('see', 'NN'),
  ('detail', 'NN'),
  ('colour', 'VBP'),
  ('sound', 'NN'),
  ('jbl', 'NN'),
  ('front', 'NN'),
  ('firing', 'VBG'),
  ('speaker', 'NN'),
  ('could', 'MD'),
  ('experience', 'VB'),
  ('clear', 'JJ'),
  ('vocal', 'JJ'),
  ('tone', 'NN'),
  ('dialogue', 'NN'),
  ('

In [None]:
Adjective = []
Adverb = []
CordinatingConjunction = []
SubordinatingConjuction = []
Interjection = []
Noun = []
Verb = []
PersonalPronoun = []
predeterminer = []
Determiner = []

In [None]:
for value in pos:
  AdjectiveCount = 0
  AdverbCount = 0
  CordinatingConjunctionCount = 0
  SubordinatingConjuctionCount = 0
  InterjectionCount = 0
  NounCount = 0
  VerbCount = 0
  PersonalPronounCount = 0
  predeterminerCount = 0
  DeterminerCount = 0
  for word,tag in value:
    if tag == 'JJ':
      AdjectiveCount = AdjectiveCount + 1
    elif tag == 'RB':
      AdverbCount = AdverbCount + 1
    elif tag == 'CC':
      CordinatingConjunctionCount = CordinatingConjunctionCount + 1
    elif tag == 'UH':
      InterjectionCount = InterjectionCount + 1
    elif tag == 'NN':
      NounCount = NounCount + 1
    elif tag == 'VR':
      VerbCount = VerbCount + 1
    elif tag == 'PRP':
      PersonalPronounCount = PersonalPronounCount + 1
    elif tag == 'PDT':
      predeterminerCount = predeterminerCount + 1
    elif tag == 'DT':
      DeterminerCount = DeterminerCount + 1
    elif tag == 'IN':
      SubordinatingConjuctionCount = SubordinatingConjuctionCount + 1
  Adjective.append(AdjectiveCount)
  Adverb.append(AdverbCount)
  CordinatingConjunction.append(CordinatingConjunctionCount)
  Interjection.append(InterjectionCount)
  Noun.append(NounCount)
  Verb.append(VerbCount)
  PersonalPronoun.append(PersonalPronounCount)
  predeterminer.append(predeterminerCount)
  Determiner.append(DeterminerCount)
  SubordinatingConjuction.append(SubordinatingConjuctionCount)

In [None]:
df['Number of Adjectives'] = Adjective
df['Number of Adverbs'] = Adverb
df['Number of Cordinating Conjunctions'] = CordinatingConjunction
df['Number of Interjections'] = Interjection
df['Number of Nouns'] = Noun
df['Number of Verbs'] = Verb
df['Number of Personal Pronouns'] = PersonalPronoun
df['Number of Predeterminers'] = predeterminer
df['Number of Determiners'] = Determiner
df['Number of Subordinating Conjuctions'] = SubordinatingConjuction
df

Unnamed: 0,Glimpse of Review,Full Review,After Preprocessing,Number of Adjectives,Number of Adverbs,Number of Cordinating Conjunctions,Number of Interjections,Number of Nouns,Number of Verbs,Number of Personal Pronouns,Number of Predeterminers,Number of Determiners,Number of Subordinating Conjuctions,Number of Right Branching Nodes,Sentenece Length
0,Review from Technology Gyan: Almost everything...,"This 55"" 4K Nokia TV at this price point comes...",k nokia tv price point come almost everything ...,11,2,0,0,27,0,0,0,0,0,0,505
1,Terrific purchase,"This might seem to be awkward, but this is the...",might seem awkward first led tv home flat tv n...,15,9,0,0,21,0,0,0,0,1,0,506
2,Brilliant,I must say it is best decision to by Nokia TV....,must say best decision nokia tv device feature...,8,1,0,0,19,0,0,0,0,0,0,394
3,Best in the market!,Flipcart delivered the Product in less than 24...,flipcart delivered product le hr really amazed...,7,3,0,0,30,0,0,0,0,1,0,509
4,Pretty good,"Pros1) Picture Quality is good, micro dimming ...",pro picture quality good micro dimming work we...,12,6,0,0,33,0,0,0,0,1,0,500
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
505,Best in the market!,Best Quality for nokiaREAD MORE,best quality nokiaread,0,0,0,0,2,0,0,0,0,0,0,31
506,Horrible,"Although all things good about this tv, one th...",although thing good tv one thing found display...,2,1,0,0,16,0,0,0,0,2,0,253
507,Wonderful,Very nice tv & sound is bestREAD MORE,nice tv sound bestread,1,0,0,0,3,0,0,0,0,0,0,37
508,Nice,Some bleeding on the screen. Sound is okay. Pi...,bleeding screen sound okay picture quality goo...,5,0,0,0,4,0,0,0,0,0,0,100


# **Linguistic features**

**Number of right-branching nodes**

In [None]:
RightBranchingNodes = []
nlp = spacy.load("en_core_web_sm")
for sentence in df['After Preprocessing']:
  doc = nlp(sentence)
  try:
    RightBranchingNodes.append(doc[0].n_rights)
  except:
    RightBranchingNodes.append('No')
df['Number of Right Branching Nodes'] = RightBranchingNodes

**Sentence Length**

In [None]:
df['Sentenece Length'] = df['Full Review'].apply(lambda x: len(x))
df

Unnamed: 0,Glimpse of Review,Full Review,After Preprocessing,Number of Adjectives,Number of Adverbs,Number of Cordinating Conjunctions,Number of Interjections,Number of Nouns,Number of Verbs,Number of Personal Pronouns,Number of Predeterminers,Number of Determiners,Number of Subordinating Conjuctions,Number of Right Branching Nodes,Sentenece Length
0,Review from Technology Gyan: Almost everything...,"This 55"" 4K Nokia TV at this price point comes...",k nokia tv price point come almost everything ...,11,2,0,0,27,0,0,0,0,0,0,505
1,Terrific purchase,"This might seem to be awkward, but this is the...",might seem awkward first led tv home flat tv n...,15,9,0,0,21,0,0,0,0,1,0,506
2,Brilliant,I must say it is best decision to by Nokia TV....,must say best decision nokia tv device feature...,8,1,0,0,19,0,0,0,0,0,0,394
3,Best in the market!,Flipcart delivered the Product in less than 24...,flipcart delivered product le hr really amazed...,7,3,0,0,30,0,0,0,0,1,0,509
4,Pretty good,"Pros1) Picture Quality is good, micro dimming ...",pro picture quality good micro dimming work we...,12,6,0,0,33,0,0,0,0,1,0,500
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
505,Best in the market!,Best Quality for nokiaREAD MORE,best quality nokiaread,0,0,0,0,2,0,0,0,0,0,0,31
506,Horrible,"Although all things good about this tv, one th...",although thing good tv one thing found display...,2,1,0,0,16,0,0,0,0,2,0,253
507,Wonderful,Very nice tv & sound is bestREAD MORE,nice tv sound bestread,1,0,0,0,3,0,0,0,0,0,0,37
508,Nice,Some bleeding on the screen. Sound is okay. Pi...,bleeding screen sound okay picture quality goo...,5,0,0,0,4,0,0,0,0,0,0,100
