## The fifth In-class-exercise (2/23/2021, 20 points in total)

In exercise-03, I asked you to collected 500 textual data based on your own information needs (If you didn't collect the textual data, you should recollect for this exercise). Now we need to think about how to represent the textual data for text classification. In this exercise, you are required to select 10 types of features (10 types of features but absolutely more than 10 features) in the followings feature list, then represent the 500 texts with these features. The output should be in the following format:
![image.png](attachment:image.png)

The feature list:

* (1) tf-idf features
* (2) POS-tag features: number of adjective, adverb, auxiliary, punctuation, complementizer, coordinating conjunction, subordinating conjunction, determiner, interjection, noun, possessor, preposition, pronoun, quantifier, verb, and other. (select some of them if you use pos-tag features)
* (3) Linguistic features:
  * number of right-branching nodes across all constituent types
  * number of right-branching nodes for NPs only
  * number of left-branching nodes across all constituent types
  * number of left-branching nodes for NPs only
  * number of premodifiers across all constituent types
  * number of premodifiers within NPs only
  * number of postmodifiers across all constituent types
  * number of postmodifiers within NPs only
  * branching index across all constituent types, i.e. the number of right-branching nodes minus number of left-branching nodes
  * branching index for NPs only
  * branching weight index: number of tokens covered by right-branching nodes minus number of tokens covered by left-branching nodes across all categories
  * branching weight index for NPs only 
  * modification index, i.e. the number of premodifiers minus the number of postmodifiers across all categories
  * modification index for NPs only
  * modification weight index: length in tokens of all premodifiers minus length in tokens of all postmodifiers across all categories
  * modification weight index for NPs only
  * coordination balance, i.e. the maximal length difference in coordinated constituents
  
  * density (density can be calculated using the ratio of folowing function words to content words) of determiners/quantifiers
  * density of pronouns
  * density of prepositions
  * density of punctuation marks, specifically commas and semicolons
  * density of auxiliary verbs
  * density of conjunctions
  * density of different pronoun types: Wh, 1st, 2nd, and 3rd person pronouns
  
  * maximal and average NP length
  * maximal and average AJP length
  * maximal and average PP length
  * maximal and average AVP length
  * sentence length

* Other features in your mind (ie., pre-defined patterns)

In [12]:
# You code here (Please add comments in the code):

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import json

def extractData(data): # Extracts title and contents of article from html data
  soup = BeautifulSoup(data, 'html.parser')
  headline_list = []
  body_list = []
  all_news = soup.findAll("div",{"class":"news-card z-depth-1"})
  count = len(all_news)
  for each_news in all_news:
    article_body = each_news.find("div", {"itemprop" : "articleBody"}).get_text()
    headline = each_news.find("span", {"itemprop" : "headline"}).get_text()
    headline_list.append(headline)
    body_list.append(article_body)
  
  return headline_list, body_list

def get_headers(): # The headers for the http request
    return {
        "accept": "*/*",
        "accept-encoding": "gzip, deflate, br",
        "accept-language": "en-IN,en-US;q=0.9,en;q=0.8",
        "content-type": "application/x-www-form-urlencoded; charset=UTF-8",
        "cookie": "_ga=GA1.2.474379061.1548476083; _gid=GA1.2.251903072.1548476083; __gads=ID=17fd29a6d34048fc:T=1548476085:S=ALNI_MaRiLYBFlMfKNMAtiW0J3b_o0XGxw",
        "origin": "https://inshorts.com",
        "referer": "https://inshorts.com/en/read/",
        "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
        "x-requested-with": "XMLHttpRequest"
    }

title_text = [] # List to store News Titles
content_text =[] #List to store News Content

temp_title = []
temp_content = []

url = "https://www.inshorts.com/en/read"
request = requests.get(url)
title_text, content_text = extractData(request.text)  # Extract articles in First page - 25 articles

regex_min_news_id = re.compile('var min_news_id = "(.*?)";')
min_news_id = regex_min_news_id.search(request.text).group(1)

for number in range(475): # Extract 475 more articles
  ajax_url = 'https://inshorts.com/en/ajax/more_news'
  response = requests.post(ajax_url, data={"category": "", "news_offset": min_news_id}, headers=get_headers())
  try:
    response_json = json.loads(response.text)
  except:
    break
  temp_title, temp_content = extractData(response_json["html"])
  title_text.append(temp_title)
  content_text.append(temp_content)
  temp_title = []
  temp_content = []
  min_news_id = response_json["min_news_id"]

df = pd.DataFrame(content_text, columns =['Article Content'])  # Creating Dataframe
df

Unnamed: 0,Article Content
0,"A lifelong promise provides care, trust and se..."
1,Congress on Saturday announced its first list ...
2,The BJP has announced its first list of 57 can...
3,The Congress has released its first list of 40...
4,"India reported 18,711 fresh cases of COVID-19 ..."
...,...
198,[At least 11 people were killed and 36 others ...
199,"[A puppy named 'Cyclops', who was born with on..."
200,[A lake has formed above Raini village in Utta...
201,[In her reply to the Budget discussions in the...


In [18]:
for index, row in df.iterrows():
    if isinstance(row['Article Content'], list):
      row['Article Content'] = "".join(row['Article Content'])

df

Unnamed: 0,Article Content
0,"A lifelong promise provides care, trust and se..."
1,Congress on Saturday announced its first list ...
2,The BJP has announced its first list of 57 can...
3,The Congress has released its first list of 40...
4,"India reported 18,711 fresh cases of COVID-19 ..."
...,...
198,At least 11 people were killed and 36 others i...
199,"A puppy named 'Cyclops', who was born with one..."
200,A lake has formed above Raini village in Uttar...
201,In her reply to the Budget discussions in the ...


In [13]:
#data cleaning and pre processing
import string
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
words=stopwords.words('english')
from textblob import Word
from textblob import TextBlob
nltk.download('punkt')
nltk.download('wordnet')
from nltk.stem import PorterStemmer
st=PorterStemmer()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [21]:
# To lower_case
df['Article Content']=df['Article Content'].apply(lambda x:" ".join(x.lower() for x in x.split()))

# remove punctuation
df['Article Content']=df['Article Content'].apply(lambda  x: " ".join(x for x in x.split() if x not in string.punctuation))

# remove special characters
df['Article Content']=df['Article Content'].apply(lambda x:" ".join(x.replace('[#,@,&,!,$,^,*]', '') for x in x.split()))

# remove stop words
df['Article Content']=df['Article Content'].apply(lambda x:" ".join(x for x in x.split() if x not in words))

# remove numbers
df['Article Content']=df['Article Content'].apply(lambda x:" ".join(x.replace('\d+', '') for x in x.split()))

# tokenize
df['Article Content']=df['Article Content'].apply(lambda x: TextBlob(x).words)

# stemming
df['Article Content']=df['Article Content'].apply(lambda x: " ".join([st.stem(word) for word in x]))

# lemmatization
df['Article Content']=df['Article Content'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))

df

Unnamed: 0,Article Content
0,lifelong promis provid care trust secur say ca...
1,congress saturday announc first list 13 candid...
2,bjp announc first list 57 candid first two pha...
3,congress releas first list 40 candid upcom thr...
4,"india report 18,711 fresh case covid-19 100 de..."
...,...
198,least 11 peopl kill 36 other injur fire firecr...
199,puppi name 'cyclop born one eye two tongu nose...
200,lake form raini villag uttarakhand 's chamoli ...
201,repli budget discus parliament financ minist n...


**TF-IDF Features**

In [36]:
def term_frequency(string):
    from collections import Counter
    tf = {}
    counts = Counter(string.split())
    for key, value in counts.items():
        tf[key] = round((value/len(counts.keys())), 3)
    return tf

In [37]:
df['tf'] = df['Article Content'].apply(term_frequency)
df['tf']

0      {'lifelong': 0.062, 'promis': 0.062, 'provid':...
1      {'congress': 0.059, 'saturday': 0.029, 'announ...
2      {'bjp': 0.061, 'announc': 0.03, 'first': 0.061...
3      {'congress': 0.059, 'releas': 0.029, 'first': ...
4      {'india': 0.091, 'report': 0.03, '18,711': 0.0...
                             ...                        
198    {'least': 0.001, '11': 0.001, 'peopl': 0.007, ...
199    {'puppi': 0.004, 'name': 0.004, ''cyclop': 0.0...
200    {'lake': 0.009, 'form': 0.002, 'raini': 0.002,...
201    {'repli': 0.001, 'budget': 0.003, 'discus': 0....
202    {'ahead': 0.001, 'assembl': 0.001, 'elect': 0....
Name: tf, Length: 203, dtype: object

In [70]:
from math import log
def inverse_document_frequency(array):
  word_counts_by_row = []
  for row in array:
    d = dict.fromkeys(row.split(' '), 0)
    word_counts_by_row.append(d)
  
  for d in word_counts_by_row:
    for word in d.keys():
      for row in array:
        words = row.split(' ')
        if word in words:
          v = d[word]
          v += 1
          d[word] = v
  for d in word_counts_by_row:
    for key, value in d.items():
      d[key] = log(array.shape[0]/value)
  return word_counts_by_row

In [71]:
df['idf'] = inverse_document_frequency(df['Article Content'].values)
df['idf']

0      {'lifelong': 5.313205979041787, 'promis': 1.91...
1      {'congress': 0.6688150799004147, 'saturday': 1...
2      {'bjp': 0.5682738506785373, 'announc': 0.40793...
3      {'congress': 0.6688150799004147, 'releas': 0.5...
4      {'india': 0.1314224287497022, 'report': 0.1774...
                             ...                        
198    {'least': 0.9564971523521957, '11': 1.48456458...
199    {'puppi': 4.214593690373678, 'name': 0.6688150...
200    {'lake': 2.8282993292537872, 'form': 1.3813803...
201    {'repli': 2.2221635256834715, 'budget': 1.5290...
202    {'ahead': 1.186071593996696, 'assembl': 0.6882...
Name: idf, Length: 203, dtype: object

In [72]:
def tf_idf(array1, array2):
    tf_idf_by_row = []
    for d1, d2 in zip(array1, array2):
        d = {}
        for key in d1.keys():
            d[key] = d1[key]*d2[key]
        tf_idf_by_row.append(d)
    return tf_idf_by_row

In [73]:
df['tf_idf'] = tf_idf(df['tf'].values, df['idf'].values)
df['tf_idf']

0      {'lifelong': 0.3294187707005908, 'promis': 0.1...
1      {'congress': 0.039460089714124465, 'saturday':...
2      {'bjp': 0.03466470489139078, 'announc': 0.0122...
3      {'congress': 0.039460089714124465, 'releas': 0...
4      {'india': 0.0119594410162229, 'report': 0.0053...
                             ...                        
198    {'least': 0.0009564971523521957, '11': 0.00148...
199    {'puppi': 0.016858374761494713, 'name': 0.0026...
200    {'lake': 0.025454693963284085, 'form': 0.00276...
201    {'repli': 0.0022221635256834717, 'budget': 0.0...
202    {'ahead': 0.001186071593996696, 'assembl': 0.0...
Name: tf_idf, Length: 203, dtype: object

**POS Tag Features**

In [45]:
# Tagging and Counting POS
nltk.download('averaged_perceptron_tagger')
from nltk.tag import pos_tag
from collections import Counter
from nltk.tokenize import word_tokenize

df['Tokens']=df['Article Content'].apply(lambda x: TextBlob(x).words)

pos=[]
for i in df['Tokens']:
  pos.append(nltk.pos_tag(i))
pos

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


[[('lifelong', 'JJ'),
  ('promis', 'NN'),
  ('provid', 'NN'),
  ('care', 'NN'),
  ('trust', 'NN'),
  ('secur', 'NNS'),
  ('say', 'VBP'),
  ('canara', 'JJ'),
  ('hsbc', 'NN'),
  ('obc', 'JJ'),
  ('life', 'NN'),
  ('make', 'VBP'),
  ('respons', 'NNS'),
  ('lifelong', 'JJ'),
  ('promis', 'JJ'),
  ('iselect', 'NN'),
  ('star', 'NN'),
  ('term', 'NN'),
  ('plan', 'NN'),
  ('offer', 'VBP'),
  ('spous', 'JJ'),
  ('cover', 'NN'),
  ('ad', 'NN'),
  ('product', 'NN'),
  ('also', 'RB'),
  ('offer', 'VBP'),
  ('100', 'CD'),
  ('return', 'NN'),
  ('premium', 'NN'),
  ('option', 'NN'),
  ('save', 'VBP'),
  ('35', 'CD'),
  ('premium', 'JJ'),
  ('limit', 'NN'),
  ('premium', 'JJ'),
  ('payment', 'NN'),
  ('option', 'NN'),
  ('said', 'VBD')],
 [('congress', 'NN'),
  ('saturday', 'NN'),
  ('announc', 'IN'),
  ('first', 'JJ'),
  ('list', 'NN'),
  ('13', 'CD'),
  ('candid', 'JJ'),
  ('first', 'RB'),
  ('two', 'CD'),
  ('phase', 'NN'),
  ('west', 'JJS'),
  ('bengal', 'JJ'),
  ('assembl', 'NNS'),
  ('elect'

**Linguistic Features**

In [74]:
import numpy as np

def calculate_features(tag_pos):

  counts = np.zeros(10)

  nouns=[]
  det=[]
  verbs=[]
  coordinate_conjuction=[]
  pronoun=[]
  posend=[]
  adverb=[]
  adjective=[]
  preposition=[]
  predet=[]
  for i in tag_pos:
    for j in i:
      if(j[1]=='NN'):
        counts[0]+=1
      elif (j[1]=='DT'):
        counts[1]+=1
      elif (j[1]=='VB'):
        counts[2]+=1
      elif (j[1]=='CC'):
        counts[3]+=1
      elif (j[1]=='PRP'):
        counts[4]+=1
      elif (j[1]=='POS'):
        counts[5]+=1
      elif (j[1]=='RB'):
        counts[6]+=1
      elif (j[1]=='JJ'):
        counts[7]+=1
      elif (j[1]=='IN'):
        counts[8]+=1
      elif (j[1]=='PDT'):
        counts[9]+=1
    nouns.append(counts[0])
    det.append(counts[1])
    verbs.append(counts[2])
    coordinate_conjuction.append(counts[3])
    pronoun.append(counts[4])
    posend.append(counts[5])
    adverb.append(counts[6])
    adjective.append(counts[7])
    preposition.append(counts[8])
    predet.append(counts[9])
    counts.fill(0) 

  return nouns,det,verbs,coordinate_conjuction,pronoun,posend,adverb,adjective,preposition,predet

count_pos=calculate_features(pos)

In [75]:
import pandas as pd

zippedlist=list(zip(count_pos[0],count_pos[1],count_pos[2],count_pos[3],count_pos[4],count_pos[5],count_pos[6],count_pos[7],count_pos[8],count_pos[9]))
df_total = pd.DataFrame(zippedlist, columns=['Nouns','Determiner','Verbs','Coordinating conjunction','Personal pronoun','Possessive ending','Adverb','Adjective','Preposition','Predeterminer'])
df_total.head(5)

Unnamed: 0,Nouns,Determiner,Verbs,Coordinating conjunction,Personal pronoun,Possessive ending,Adverb,Adjective,Preposition,Predeterminer
0,19.0,0.0,0.0,0.0,0.0,0.0,1.0,8.0,0.0,0.0
1,19.0,0.0,0.0,0.0,0.0,0.0,1.0,6.0,1.0,0.0
2,18.0,0.0,0.0,0.0,0.0,0.0,1.0,8.0,0.0,0.0
3,25.0,0.0,0.0,0.0,0.0,0.0,1.0,8.0,1.0,0.0
4,20.0,0.0,0.0,0.0,0.0,2.0,0.0,10.0,0.0,0.0


In [76]:
df_tokens=df['Tokens']
df_tokens.head(5)

0    [lifelong, promis, provid, care, trust, secur,...
1    [congress, saturday, announc, first, list, 13,...
2    [bjp, announc, first, list, 57, candid, first,...
3    [congress, releas, first, list, 40, candid, up...
4    [india, report, 18,711, fresh, case, covid-19,...
Name: Tokens, dtype: object

In [77]:
final_df = pd.concat([df_tokens, df_total],ignore_index=True, sort=False,axis=1)
final_df.columns=['text','Nouns','Determiner','Verbs','Coordinating conjunction','Personal pronoun','Possessive ending','Adverb','Adjective','Preposition','Predeterminer']
final_df.head(13)

Unnamed: 0,text,Nouns,Determiner,Verbs,Coordinating conjunction,Personal pronoun,Possessive ending,Adverb,Adjective,Preposition,Predeterminer
0,"[lifelong, promis, provid, care, trust, secur,...",19.0,0.0,0.0,0.0,0.0,0.0,1.0,8.0,0.0,0.0
1,"[congress, saturday, announc, first, list, 13,...",19.0,0.0,0.0,0.0,0.0,0.0,1.0,6.0,1.0,0.0
2,"[bjp, announc, first, list, 57, candid, first,...",18.0,0.0,0.0,0.0,0.0,0.0,1.0,8.0,0.0,0.0
3,"[congress, releas, first, list, 40, candid, up...",25.0,0.0,0.0,0.0,0.0,0.0,1.0,8.0,1.0,0.0
4,"[india, report, 18,711, fresh, case, covid-19,...",20.0,0.0,0.0,0.0,0.0,2.0,0.0,10.0,0.0,0.0
5,"[india, captain, virat, kohli, saturday, said,...",19.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,1.0,0.0
6,"[india, all-round, washington, sundar, 's, fat...",17.0,0.0,0.0,0.0,1.0,1.0,2.0,8.0,1.0,0.0
7,"[ex-india, open, virend, sehwag, took, instagr...",20.0,0.0,1.0,0.0,0.0,0.0,1.0,9.0,0.0,0.0
8,"[england, woman, cricket, alexandra, hartley, ...",30.0,0.0,1.0,1.0,0.0,2.0,0.0,5.0,0.0,0.0
9,"[former, india, batsman, sachin, tendulkar, to...",15.0,1.0,0.0,0.0,1.0,1.0,1.0,9.0,1.0,0.0
