# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [None]:
# Your code here
#Densho Digital Repository

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
from urllib.error import HTTPError
import pandas as pd
import time
from google.colab import drive

start_time = time.time()

target_url = "https://ddr.densho.org/narrators/?page={}"
num_page = 40
narrator = ""

narrator_name = []
narrator_desc = []

for i in range(1, num_page):
  #open URL
  try:
    link = Request(target_url.format(i), headers={'User-Agent': 'Mozilla/5.0'})
    url = urlopen(link)

    data = url.read()
    retrieved_data = BeautifulSoup(data)

    #get narrator details
    narrator = retrieved_data.find_all("a", attrs={'class':'item-hover'})
  except ConnectionError as e:
    print(e)
  except Exception as e:
    print(e)

  for j in narrator:
    #open another request to get the narrators details
    try:
      profile_url = Request(j['href'], headers={'User-Agent': 'Mozilla/5.0'})
      profile_data = BeautifulSoup(urlopen(profile_url).read())

      narrator_detail = profile_data.find_all("div", attrs={'class':'col-sm-8 col-md-8'})

      for k in narrator_detail:
        print(k.h1.text)
        narrator_name.append(k.h1.text)
        narrator_desc.append(k.p.text)

        time.sleep(0.5)

    except ConnectionError as e:
      print(e)
    except Exception as e:
      print(e)

  dataframe = pd.DataFrame({"name": narrator_name, "desc": narrator_desc})

drive.mount('drive', force_remount=True)
dataframe.to_csv('/content/drive/My Drive/densho_11700380.csv', encoding='utf-8')

print("Completed in --- %s seconds ---" % (time.time() - start_time))


  Kay Aiko Abe


  Art Abe


  Sharon Tanagi Aburano


  Toshiko Aiboshi


  Douglas L. Aihara


  Yae Aihara


  Elaine Akagi


  Nelson Takeo Akagi


  Tom Akashi


  Mas Akiyama


  Sab Akiyama


  Sumie Suguro Akizuki


  Harry Akune


  Kenjiro Akune


  Gene Akutsu


  Jim Akutsu


  Dorothy Almojuela


  Alice Matsumoto Ando


  Emery Brooks Andrews


  Margie Nahmias Angel


  Dan Aoki


  Nancy K. Araki


  Sam Araki


  Sakaye Aratani


  Terry Aratani


  Yoshiko Asakura


  Setsuko Izumi Asano


  George Azumano


  Peggie Nishimura Bain


  Sarah Baker


  Dennis Bambauer


  Kathryn Bannai


  Lorraine Bannai


  Paul Bannai


  Yone Bartholomew


  Gerald L. Beppu


  June Yasuno Aochi (Yamashiro) Berk


  Marion Michiko Bernardo


  Angela Berry


  Bob Berry


  Ernest Besig


  Theo Bickel


  Kazuko Uno Bill


  Mabel Shoji Boggs


  David R. Boyd


  Robert "Bob" Bratt


  Bill Braye


  Nikki Bridges


  Harold "Hal" Champeness


  Connie Thorson Chandler


  Joan

# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [5]:
# Write code for each of the sub parts with proper comments.
import pandas as pd
import io
from google.colab import drive
from nltk.corpus import stopwords
import nltk
import string
import re

nltk.download('stopwords')
nltk.download('punkt')

#load the csv back to Colab
drive.mount('drive', force_remount=True)

dataframe = pd.read_csv('/content/drive/My Drive/densho_11700380.csv')

#combine the name and the desc
dataframe['desc'] = dataframe['name'] +" "+ dataframe['desc']

#1 remove noise special characters and punctuations
dataframe['desc'] = dataframe['desc'].str.replace("[^A-Za-z]+"," ")

#2 remove number
dataframe['desc'] = dataframe['desc'].str.replace("[0-9]+"," ")

#3 remove stopwords
stop = stopwords.words('english')
dataframe['desc'] = dataframe['desc'].apply(lambda x: " ".join(x for x in str(x).split() if x not in stop))

#4 lowercase all texts
dataframe['desc'] = dataframe['desc'].str.lower()

#5 stemming
from nltk.stem import PorterStemmer
st = PorterStemmer()
dataframe['desc'] = dataframe['desc'].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))

#6 lemmatization
from textblob import Word
import nltk
nltk.download('wordnet')

dataframe['desc'] = dataframe['desc'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))

dataframe.to_csv('/content/drive/My Drive/densho_11700380_cleaned.csv', encoding='utf-8')




[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Mounted at drive


  dataframe['desc'] = dataframe['desc'].str.replace("[^A-Za-z]+"," ")
  dataframe['desc'] = dataframe['desc'].str.replace("[0-9]+"," ")
[nltk_data] Downloading package wordnet to /root/nltk_data...


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [7]:
# Your code here
#1 POS tagging
!pip install stanza
import pandas as pd
import nltk
import stanza
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize

import spacy

from collections import Counter

from google.colab import drive

nltk.download('averaged_perceptron_tagger')

#combine all rows to one single string
big_string = " ".join(dataframe['desc'])

print(Counter([j for i,j in pos_tag(word_tokenize(big_string))]))

nlp = spacy.load('en_core_web_sm')


#2 Constituency Parsing and Dependency Parsing
#Constituency Parsing
print("------Constituency Parsing------")

const = stanza.Pipeline(lang='en', processors='tokenize,pos,constituency')

for sents in dataframe['desc']:
  doc = const(sents)

  for sentence in doc.sentences:
    print(sentence.constituency)

#Dependency Parsing
print("------Dependency Parsing------")
for sents in dataframe['desc']:
  doc_1 = nlp(sents)

  for chunk in doc_1.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
            chunk.root.head.text)


#3 Named Entity Recognition
print("------Named Entity Recognition------")
text = []
label = []

doc_2 = nlp(big_string)

for ent in doc_2.ents:
  text.append(ent.text)
  label.append(ent.label_)

entities = pd.DataFrame({"text": text, "label": label})
entities = entities.groupby(['text','label'])['text'].count()

entities.head()




[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Counter({'NN': 31883, 'JJ': 9272, 'VBD': 1746, 'IN': 1626, 'VBP': 1607, 'VBN': 937, 'NNS': 899, 'RB': 814, 'VB': 561, 'VBZ': 268, 'CD': 245, 'FW': 217, 'DT': 147, 'PRP': 123, 'MD': 103, 'NNP': 91, 'JJR': 66, 'JJS': 62, 'CC': 52, 'RBR': 34, 'WRB': 27, 'RP': 20, 'PRP$': 14, 'VBG': 14, 'WP': 7, 'WP$': 5, 'RBS': 4, 'WDT': 2, 'TO': 2, 'PDT': 1, 'UH': 1})


INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


------Constituency Parsing------


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: en (English):
| Processor    | Package             |
--------------------------------------
| tokenize     | combined            |
| mwt          | combined            |
| pos          | combined_charlm     |
| constituency | ptb3-revised_charlm |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: pos
INFO:stanza:Loading: constituency
INFO:stanza:Done loading processors!


(ROOT (S (NP (NML (NNP kay) (NN aiko)) (NNP abe) (NNP nisei) (NNP femal) (VBN born)) (VP (MD may) (S (NP (NNP selleck) (SBAR (S (NP (NNP washington)) (VP (VBD spent) (NP (JJ much) (NN childhood) (NML (NML (NNP beaverton) (NNP oregon) (NN father)) (JJ own)) (NML (NN farm) (NN influenc)) (JJ earli) (NN age) (NN parent)))))) (VP (VP (VBZ convers)) (NP (NML (JJ christian) (NNP dure)) (NN world) (NN war)) (ADVP (NNP ii)) (VP (VB remov) (NP (NML (NNP portland) (NN assembl) (NN center)) (NNP oregon) (NNP minidoka) (NN concentr) (NN camp) (NNP idaho)) (SBAR (IN after) (S (NP (NN war) (NN work)) (VP (VBP establish) (NP (NP (NN success) (NN volunt) (NN program)) (VP (NN feed) (NP (JJ homeless) (NN seattl) (NNP washington)))))))))))))
(ROOT (S (NP (NML (NN art) (NNP abe)) (NNP nisei) (ADJP (JJ male) (VBN born)) (NML (NNP june) (NNP seattl)) (NML (NNP washington))) (VP (VBD grew) (NP (NP (ADJP (NP (NN area) (NN seattl) (FW japanes)) (JJ american) (VB attend)) (NN univers) (NNP washington) (NML (NM

KeyboardInterrupt: 

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
# Write your response below
'''
I have a love-and-hate relationship with the API. I tried to use the Semantic Scholar API, but their server was not working
and kept sending me 504 and 505. In the end, I decided to scrap the Densho repository instead.

I enjoyed working with the scraper code, but I was lost when diving deeper into the NLP things.
I needed to grasp the concept first before doing the code, especially the Constituency Parsing and Dependency Parsing
I could do the code, but hardly understood what the codes were really about.

The completion time for this assignment was about right from my perspective.

However, would the sentences lose meaning if we clean the sentences (for example, removing stopwords)
and perform the Constituency Parsing or Dependency Parsing?

'''

'\nI have a love-and-hate relationship with the API. I tried to use the Semantic Scholar API, but their server was not working. \nI kept getting 504 and 505. In the end, I decided to scrap the Densho repository instead. \n\nI enjoyed working with the scraper code, but I was lost when diving deeper into the NLP things. \nI needed to grasp the concept first before doing the code. \nI could do the code, but I hardly understood what the codes were\xa0really\xa0about.\n\nThe completion time for this assignment was about right from my\xa0perspective.\n\n'