<a href="https://colab.research.google.com/github/navyaravi/INFO_5731_Spring_2021/blob/main/In_class_exercise_05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## The fifth In-class-exercise (2/23/2021, 20 points in total)

In exercise-03, I asked you to collected 500 textual data based on your own information needs (If you didn't collect the textual data, you should recollect for this exercise). Now we need to think about how to represent the textual data for text classification. In this exercise, you are required to select 10 types of features (10 types of features but absolutely more than 10 features) in the followings feature list, then represent the 500 texts with these features. The output should be in the following format:
![image.png](attachment:image.png)

The feature list:

* (1) tf-idf features
* (2) POS-tag features: number of adjective, adverb, auxiliary, punctuation, complementizer, coordinating conjunction, subordinating conjunction, determiner, interjection, noun, possessor, preposition, pronoun, quantifier, verb, and other. (select some of them if you use pos-tag features)
* (3) Linguistic features:
  * number of right-branching nodes across all constituent types
  * number of right-branching nodes for NPs only
  * number of left-branching nodes across all constituent types
  * number of left-branching nodes for NPs only
  * number of premodifiers across all constituent types
  * number of premodifiers within NPs only
  * number of postmodifiers across all constituent types
  * number of postmodifiers within NPs only
  * branching index across all constituent types, i.e. the number of right-branching nodes minus number of left-branching nodes
  * branching index for NPs only
  * branching weight index: number of tokens covered by right-branching nodes minus number of tokens covered by left-branching nodes across all categories
  * branching weight index for NPs only 
  * modification index, i.e. the number of premodifiers minus the number of postmodifiers across all categories
  * modification index for NPs only
  * modification weight index: length in tokens of all premodifiers minus length in tokens of all postmodifiers across all categories
  * modification weight index for NPs only
  * coordination balance, i.e. the maximal length difference in coordinated constituents
  
  * density (density can be calculated using the ratio of folowing function words to content words) of determiners/quantifiers
  * density of pronouns
  * density of prepositions
  * density of punctuation marks, specifically commas and semicolons
  * density of auxiliary verbs
  * density of conjunctions
  * density of different pronoun types: Wh, 1st, 2nd, and 3rd person pronouns
  
  * maximal and average NP length
  * maximal and average AJP length
  * maximal and average PP length
  * maximal and average AVP length
  * sentence length

* Other features in your mind (ie., pre-defined patterns)

In [None]:
# Please write your code here

**Extracting Data**

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
main_text = [] # List to store Review headings
sub_text =[] #List to store reviews
for number in range(52):
  link = "https://www.flipkart.com/realme-7-mist-white-64-gb/product-reviews/itme55d08631f19b?pid=MOBFUYUNDSH7NMVT&lid=LSTMOBFUYUNDSH7NMVTWLABPJ&marketplace=FLIPKART&page=" + str(number) # Generating link dynamically
  page = requests.get(link) # Accessing the webpage
  soup = BeautifulSoup(page.text, 'html.parser')
  main_reviews = soup.find_all(class_='_2-N8zT') # Getting the Review Heading by using the class name
  text_reviews = soup.find_all(class_='t-ZTKy') # Getting the full reviews by using the class name
  for ele, sub_ele in zip(main_reviews, text_reviews) : # Iterating through the list
      main_text.append(ele.text) #Appending to empty list
      sub_text.append(sub_ele.text)
df = pd.DataFrame(list(zip(main_text, sub_text)), columns =['Glimpse of Review', 'Full Review'])  # Creating Dataframe
print("Length of data frame is {0}".format(len(df)))
df

Length of data frame is 510


Unnamed: 0,Glimpse of Review,Full Review
0,Good quality product,Good value for money and a phone with good cam...
1,Must buy!,best camera and display and best performance i...
2,Worth the money,Writting this after using for 1 month.Pros.1. ...
3,Wonderful,When we talk about a brand loyal person....so ...
4,Mind-blowing purchase,Best phone in the range Camera is better ✔️Bat...
...,...,...
505,Excellent,GoodREAD MORE
506,Classy product,awesomeREAD MORE
507,Must buy!,HappyREAD MORE
508,Just wow!,Nice phoneREAD MORE


**Preprocessing Data**

**Converting to lower case**

In [2]:
df['After Preprocessing'] = df['Full Review'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df

Unnamed: 0,Glimpse of Review,Full Review,After Preprocessing
0,Good quality product,Good value for money and a phone with good cam...,good value for money and a phone with good cam...
1,Must buy!,best camera and display and best performance i...,best camera and display and best performance i...
2,Worth the money,Writting this after using for 1 month.Pros.1. ...,writting this after using for 1 month.pros.1. ...
3,Wonderful,When we talk about a brand loyal person....so ...,when we talk about a brand loyal person....so ...
4,Mind-blowing purchase,Best phone in the range Camera is better ✔️Bat...,best phone in the range camera is better ✔️bat...
...,...,...,...
505,Excellent,GoodREAD MORE,goodread more
506,Classy product,awesomeREAD MORE,awesomeread more
507,Must buy!,HappyREAD MORE,happyread more
508,Just wow!,Nice phoneREAD MORE,nice phoneread more


**Removing Punctuation**

In [3]:

df['After Preprocessing'] = df['After Preprocessing'].str.replace('[^\w\s]','')
df

Unnamed: 0,Glimpse of Review,Full Review,After Preprocessing
0,Good quality product,Good value for money and a phone with good cam...,good value for money and a phone with good cam...
1,Must buy!,best camera and display and best performance i...,best camera and display and best performance i...
2,Worth the money,Writting this after using for 1 month.Pros.1. ...,writting this after using for 1 monthpros1 cam...
3,Wonderful,When we talk about a brand loyal person....so ...,when we talk about a brand loyal personso here...
4,Mind-blowing purchase,Best phone in the range Camera is better ✔️Bat...,best phone in the range camera is better batte...
...,...,...,...
505,Excellent,GoodREAD MORE,goodread more
506,Classy product,awesomeREAD MORE,awesomeread more
507,Must buy!,HappyREAD MORE,happyread more
508,Just wow!,Nice phoneREAD MORE,nice phoneread more


**Removing Numerics**

In [4]:
df['After Preprocessing'] = df['After Preprocessing'].apply(lambda x: ''.join([i for i in x if not i.isdigit()]))

**Removing Special Characters**

In [5]:
import re
df['After Preprocessing'] = df['After Preprocessing'].apply(lambda x: ''.join(re.sub(r"[^a-zA-Z0-9]+", ' ', charctr) for charctr in x ))
df

Unnamed: 0,Glimpse of Review,Full Review,After Preprocessing
0,Good quality product,Good value for money and a phone with good cam...,good value for money and a phone with good cam...
1,Must buy!,best camera and display and best performance i...,best camera and display and best performance i...
2,Worth the money,Writting this after using for 1 month.Pros.1. ...,writting this after using for monthpros camer...
3,Wonderful,When we talk about a brand loyal person....so ...,when we talk about a brand loyal personso here...
4,Mind-blowing purchase,Best phone in the range Camera is better ✔️Bat...,best phone in the range camera is better batte...
...,...,...,...
505,Excellent,GoodREAD MORE,goodread more
506,Classy product,awesomeREAD MORE,awesomeread more
507,Must buy!,HappyREAD MORE,happyread more
508,Just wow!,Nice phoneREAD MORE,nice phoneread more


In [6]:
import nltk
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> all
    Downloading collection 'all'
       | 
       | Downloading package abc to /root/nltk_data...
       |   Unzipping corpora/abc.zip.
       | Downloading package alpino to /root/nltk_data...
       |   Unzipping corpora/alpino.zip.
       | Downloading package biocreative_ppi to /root/nltk_data...
       |   Unzipping corpora/biocreative_ppi.zip.
       | Downloading package brown to /root/nltk_data...
       |   Unzipping corpora/brown.zip.
       | Downloading package brown_tei to /root/nltk_data...
       |   Unzipping corpora/brown_tei.zip.
       | Downloading package cess_cat to /root/nltk_data...
       |   Unzipping corpora/cess_cat.zip.
       | Downloading package

True

**Removing Stop Words**

In [7]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
df['After Preprocessing'] = df['After Preprocessing'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df

Unnamed: 0,Glimpse of Review,Full Review,After Preprocessing
0,Good quality product,Good value for money and a phone with good cam...,good value money phone good camera great batte...
1,Must buy!,best camera and display and best performance i...,best camera display best performance segment e...
2,Worth the money,Writting this after using for 1 month.Pros.1. ...,writting using monthpros camera good price ran...
3,Wonderful,When we talk about a brand loyal person....so ...,talk brand loyal personso amit pleasure say th...
4,Mind-blowing purchase,Best phone in the range Camera is better ✔️Bat...,best phone range camera better battery backup ...
...,...,...,...
505,Excellent,GoodREAD MORE,goodread
506,Classy product,awesomeREAD MORE,awesomeread
507,Must buy!,HappyREAD MORE,happyread
508,Just wow!,Nice phoneREAD MORE,nice phoneread


**Spelling Correction**

In [8]:
from textblob import TextBlob
df['After Preprocessing'].apply(lambda x: str(TextBlob(x).correct()))

0      good value money phone good camera great batte...
1      best camera display best performance segment e...
2      writing using monthpros camera good price rang...
3      talk brand loyal persons amid pleasure say tha...
4      best phone range camera better battery back ok...
                             ...                        
505                                             goodread
506                                          awesomeread
507                                            happyread
508                                       nice phoneread
509                                      happy buy tread
Name: After Preprocessing, Length: 510, dtype: object

**Stemming**

In [9]:
from nltk.stem import PorterStemmer
st = PorterStemmer()
df['After Preprocessing'].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))

0      good valu money phone good camera great batter...
1      best camera display best perform segment excel...
2      writ use monthpro camera good price rang compa...
3      talk brand loyal personso amit pleasur say tha...
4      best phone rang camera better batteri backup o...
                             ...                        
505                                             goodread
506                                          awesomeread
507                                            happyread
508                                       nice phoneread
509                                     happi buy itread
Name: After Preprocessing, Length: 510, dtype: object

**Lemmatization**

In [10]:
from textblob import Word
import nltk
nltk.download('wordnet')

df['After Preprocessing'] = df['After Preprocessing'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
df

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,Glimpse of Review,Full Review,After Preprocessing
0,Good quality product,Good value for money and a phone with good cam...,good value money phone good camera great batte...
1,Must buy!,best camera and display and best performance i...,best camera display best performance segment e...
2,Worth the money,Writting this after using for 1 month.Pros.1. ...,writting using monthpros camera good price ran...
3,Wonderful,When we talk about a brand loyal person....so ...,talk brand loyal personso amit pleasure say th...
4,Mind-blowing purchase,Best phone in the range Camera is better ✔️Bat...,best phone range camera better battery backup ...
...,...,...,...
505,Excellent,GoodREAD MORE,goodread
506,Classy product,awesomeREAD MORE,awesomeread
507,Must buy!,HappyREAD MORE,happyread
508,Just wow!,Nice phoneREAD MORE,nice phoneread


**Parts of Speech Tagging and Features**

In [11]:
from nltk.tokenize import word_tokenize
pos = []
for sentence in df['After Preprocessing']:
  text = word_tokenize(sentence)
  pos.append(nltk.pos_tag(text))
pos

[[('good', 'JJ'),
  ('value', 'NN'),
  ('money', 'NN'),
  ('phone', 'NN'),
  ('good', 'JJ'),
  ('camera', 'NN'),
  ('great', 'JJ'),
  ('battery', 'NN'),
  ('life', 'NN'),
  ('fast', 'VBD'),
  ('charging', 'VBG'),
  ('life', 'NN'),
  ('saver', 'NN'),
  ('sometimesread', 'NN')],
 [('best', 'JJS'),
  ('camera', 'NN'),
  ('display', 'NN'),
  ('best', 'JJS'),
  ('performance', 'NN'),
  ('segment', 'NN'),
  ('excellent', 'JJ'),
  ('gaming', 'VBG'),
  ('pubg', 'NN'),
  ('experiece', 'NN'),
  ('perfect', 'JJ'),
  ('g', 'NN'),
  ('processor', 'NN'),
  ('give', 'VBP'),
  ('best', 'JJS'),
  ('performanceread', 'NN')],
 [('writting', 'VBG'),
  ('using', 'VBG'),
  ('monthpros', 'NNS'),
  ('camera', 'RB'),
  ('good', 'JJ'),
  ('price', 'NN'),
  ('range', 'NN'),
  ('compact', 'JJ'),
  ('design', 'NN'),
  ('easily', 'RB'),
  ('usable', 'JJ'),
  ('one', 'CD'),
  ('hand', 'NN'),
  ('heavy', 'JJ'),
  ('bloatware', 'NN'),
  ('installed', 'VBN'),
  ('realme', 'NN'),
  ('apps', 'NN'),
  ('battery', 'NN'),
 

In [12]:
Adjective = []
Adverb = []
CordinatingConjunction = []
SubordinatingConjuction = []
Interjection = []
Noun = []
Verb = []
PersonalPronoun = []
predeterminer = []
Determiner = []

In [13]:
for value in pos:
  AdjectiveCount = 0
  AdverbCount = 0
  CordinatingConjunctionCount = 0
  SubordinatingConjuctionCount = 0
  InterjectionCount = 0
  NounCount = 0
  VerbCount = 0
  PersonalPronounCount = 0
  predeterminerCount = 0
  DeterminerCount = 0
  for word,tag in value:
    if tag == 'JJ':
      AdjectiveCount = AdjectiveCount + 1
    elif tag == 'RB':
      AdverbCount = AdverbCount + 1
    elif tag == 'CC':
      CordinatingConjunctionCount = CordinatingConjunctionCount + 1
    elif tag == 'UH':
      InterjectionCount = InterjectionCount + 1
    elif tag == 'NN':
      NounCount = NounCount + 1
    elif tag == 'VR':
      VerbCount = VerbCount + 1
    elif tag == 'PRP':
      PersonalPronounCount = PersonalPronounCount + 1
    elif tag == 'PDT':
      predeterminerCount = predeterminerCount + 1
    elif tag == 'DT':
      DeterminerCount = DeterminerCount + 1
    elif tag == 'IN':
      SubordinatingConjuctionCount = SubordinatingConjuctionCount + 1
  Adjective.append(AdjectiveCount)
  Adverb.append(AdverbCount)
  CordinatingConjunction.append(CordinatingConjunctionCount)
  Interjection.append(InterjectionCount)
  Noun.append(NounCount)
  Verb.append(VerbCount)
  PersonalPronoun.append(PersonalPronounCount)
  predeterminer.append(predeterminerCount)
  Determiner.append(DeterminerCount)
  SubordinatingConjuction.append(SubordinatingConjuctionCount)

In [14]:
df['Number of Adjectives'] = Adjective
df['Number of Adverbs'] = Adverb
df['Number of Cordinating Conjunctions'] = CordinatingConjunction
df['Number of Interjections'] = Interjection
df['Number of Nouns'] = Noun
df['Number of Verbs'] = Verb
df['Number of Personal Pronouns'] = PersonalPronoun
df['Number of Predeterminers'] = predeterminer
df['Number of Determiners'] = Determiner
df['Number of Subordinating Conjuctions'] = SubordinatingConjuction
df

Unnamed: 0,Glimpse of Review,Full Review,After Preprocessing,Number of Adjectives,Number of Adverbs,Number of Cordinating Conjunctions,Number of Interjections,Number of Nouns,Number of Verbs,Number of Personal Pronouns,Number of Predeterminers,Number of Determiners,Number of Subordinating Conjuctions
0,Good quality product,Good value for money and a phone with good cam...,good value money phone good camera great batte...,3,0,0,0,9,0,0,0,0,0
1,Must buy!,best camera and display and best performance i...,best camera display best performance segment e...,2,0,0,0,9,0,0,0,0,0
2,Worth the money,Writting this after using for 1 month.Pros.1. ...,writting using monthpros camera good price ran...,10,5,0,0,20,0,0,0,0,0
3,Wonderful,When we talk about a brand loyal person....so ...,talk brand loyal personso amit pleasure say th...,5,3,0,0,26,0,0,0,0,4
4,Mind-blowing purchase,Best phone in the range Camera is better ✔️Bat...,best phone range camera better battery backup ...,2,0,0,0,9,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
505,Excellent,GoodREAD MORE,goodread,0,0,0,0,1,0,0,0,0,0
506,Classy product,awesomeREAD MORE,awesomeread,0,0,0,0,1,0,0,0,0,0
507,Must buy!,HappyREAD MORE,happyread,0,0,0,0,1,0,0,0,0,0
508,Just wow!,Nice phoneREAD MORE,nice phoneread,1,0,0,0,1,0,0,0,0,0


**Linguistic features**
**Number of right-branching nodes **

In [11]:
pip install -U spacy



In [9]:
nlp = spacy.load("en_core_web_sm")
RightBranchingNodes = []
for sentence in df['After Preprocessing']:
  doc = nlp(sentence)
  try:
    RightBranchingNodes.append(doc[0].n_rights)
  except:
    RightBranchingNodes.append('No')
df['Number of Right Branching Nodes'] = RightBranchingNodes

NameError: ignored

**Sentence Length**

In [None]:
df['Sentenece Length'] = df['Full Review'].apply(lambda x: len(x))
df