# DS 3000 Quiz 1

Due by: Tuesday Oct 9 @ 11:59 PM EST

Time Limit: You have 2 hours to complete the assignment once started

## Instructions

This quiz has 100 points total.

- You are welcome to post a private note on piazza, but to keep a consistent testing environment for all students we are unlikely to provide assistance.
- You may not contact other students with information about this this quiz
    - even saying "it was easy/hard" in a general sense can introduce a bias in favor of students who take the quiz earlier or later
- Under no circumstances should you share a copy of this quiz with anyone who isn't a member of the course staff.
- Take this quiz with open notes and feel free to access any online resource / documentation you'd like.  

### Submission Instructions
After completing the quiz below, please follow the instructions below to submit:
1. "Kernel" -> "Restart & Run All"
1. save your quiz file to this latest version
1. upload the `.ipynb` to gradescope **before** clicking submit
1. ensure that you can see your jupyter notebook in the gradescope interface after clicking "submit"

We specify the last note above as gradescope has allowed students to "submit" without uploading a file.  It is your responsibility to ensure that you've actually submitted a file.

### Academic Integrity Pledge

Input your name below to sign the Academic Integrity Pledge before continuing with the quiz. Failure to do so will result in a score of **0**.

In [1]:
name = 'Josue Antonio'
print(f'I, {name}, declare that the following work is entirely my own, and that I did not copy or seek help from any students who have currently or previously taken this course, nor from any online source other than private messages between myself and the professor on Piazza/via email.')

I, Josue Antonio, declare that the following work is entirely my own, and that I did not copy or seek help from any students who have currently or previously taken this course, nor from any online source other than private messages between myself and the professor on Piazza/via email.


In [2]:
# the following modules will be necessary to complete the quiz
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from datetime import datetime
import pandas as pd
import requests
from bs4 import BeautifulSoup
import json

# Part 1: Dictionary API (50 points)

Using [this dictionary API](https://dictionaryapi.dev/) create the following dataframe by searching for the words `hello`, `data` and `science`.

Note that your searches may return multiple words, multiple definitions or multiple pronounciations.  Where necessary, always select the first.  


|   |    word |                                                          url_pronounce |                                                                                                                                                                                              definition |
|--:|--------:|-----------------------------------------------------------------------:|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
| 0 |   hello |     https://api.dictionaryapi.dev/media/pronunciations/en/hello-au.mp3 |                                                                                                                                                                     "Hello!" or an equivalent greeting. |
| 1 |    data |   https://api.dictionaryapi.dev/media/pronunciations/en/data-au-nz.mp3 | (plural: data) A measurement of something on a scale understood by both the recorder (a person or device) and the reader (another person or device). The scale is arbitrarily defined, such as from ... |
| 2 | science | https://api.dictionaryapi.dev/media/pronunciations/en/science-1-ca.mp3 |                                                A particular discipline or branch of learning, especially one dealing with measurable or systematic principles rather than intuition or natural ability. |

**Note:** Because each row of the pandas dataframe contains so many characters, you may find that:

    pd.options.display.max_colwidth = 200
    
allows you to see the whole thing.

**Note Also:** Your response need not build any functions, but be sure to name variables appropriately and document your process.

In [3]:
def dict_api(word):
    """ Extract word's pronunciation url and definition from the https://dictionaryapi.dev API
        Params: word of interest (string)
        Returns: pronunciation url and definition of word (as strings)
        Desc: Formats url for given word; create BeautifulSoup object from url; convert soup data to json;
              extract pronunciation url and definition
    """
    url = f"https://api.dictionaryapi.dev/api/v2/entries/en/{word}"
    html = requests.get(url).text
    soup = BeautifulSoup(html)
    data = soup.text
    jsn = json.loads(data)
    url_pronounce = jsn[0]['phonetics'][0]['audio']
    definition = jsn[0]['meanings'][0]['definitions'][0]['definition']
    return url_pronounce, definition

# initialize words of interest and vars
# fill urls and defs list with data corresponding ot each word of interest
words = ['hello', 'data', 'science']
urls, defs = [], []
for word in words:
    u, d = dict_api(word)
    urls.append(u)
    defs.append(d)

# initialize dict where keys are the words, pronunciation urls, definitions, and the corresponding
# values are the lists of words, urls, and definitions
dict_api = {'word': words,
            'url_pronounce': urls,
            'definition': defs}

# construct df from dict and display it
# limit # of characters for each row of df
dict_api_df = pd.DataFrame.from_dict(dict_api)
pd.options.display.max_colwidth = 200
dict_api_df

Unnamed: 0,word,url_pronounce,definition
0,hello,https://api.dictionaryapi.dev/media/pronunciations/en/hello-au.mp3,"""Hello!"" or an equivalent greeting."
1,data,https://api.dictionaryapi.dev/media/pronunciations/en/data-au-nz.mp3,"(plural: data) A measurement of something on a scale understood by both the recorder (a person or device) and the reader (another person or device). The scale is arbitrarily defined, such as from ..."
2,science,https://api.dictionaryapi.dev/media/pronunciations/en/science-1-ca.mp3,"A particular discipline or branch of learning, especially one dealing with measurable or systematic principles rather than intuition or natural ability."


# Part 2: Web Scraping Korean Dramas (50 points)

Your goal is to build a data frame that includes two columns: `category` and `movie` based on the 50 best Korean Dramas according to [this website](https://www.marieclaire.com/culture/a26895105/best-korean-dramas/). To help you, the actual web scraping part of this problem is done in the first code cell below, along with the first step of cleaning the data. The result is a list of headers from the web site. Note:

- certain elements in the list are the categories
- all elements trailing a category belong to that category until a new category appears

**Note:** the below are directions for one way to accomplish the task. If you can think of a faster way to do it, please do so!

Create two empty lists, `kdramas_cats` and `kdramas_movs`. Then, loop through the `headers` and build out the lists such that the `kdramas_cats` contains the category corresponding to each movie and `kdramas_movs` contains all the movies. Then, use these lists to create a data frame with a column called `category` (where the `kdramas_cats` data are stored) and a column called `movie` (where the `kdramas_movs` data are stored). Make sure you clean the data so that:

- the categories do not have the `' Korean Dramas'` part of their string
- the last element of `headers` is not included (since it is not a movie, but rather an advertisment that happened to share the `<h2>` tag)

When you are done, print the entire data frame to ensure it all worked.

**Note:** Your response need not build any functions, but be sure to name variables appropriately and document your process.

In [4]:
# the url, scraper, and soup object
url = 'https://www.marieclaire.com/culture/a26895105/best-korean-dramas/'
html_text = requests.get(url).text
soup = BeautifulSoup(html_text)

# parsing the data to get only the headers
headers = soup.find_all('h2')

# initializing vars
kdramas_cats, kdramas_movs = [], []
cat_counter = 0
counters = []

# populate movies list - append to movies if header starts with " ' "
# strip quotes from each movie title and keep count of the # of categories 
for header in headers:
    if header.text.startswith("'"):
        kdramas_movs.append(header.text.strip("'"))
        cat_counter += 1
    else:
        # populate counters list - append (category, cat_number) tuples to keep track of cats occurrences
        # clean up category by removing 'Korean Dramas' part
        counters.append((header.text.replace('Korean Dramas', ''), cat_counter))

# list of concatenated strings of the form 'category category ... category' for the number of cat. occurrences
concat_cats = [counters[i][0] * (counters[i+1][1] - counters[i][1]) for i in range(len(counters) - 1)]

# populate categories list - from each category's concat. string, split it and append to cats. list
# handle special case when cat. is "Slice of Life"
for cat in concat_cats:
    if "Slice of Life" in cat:
        kdramas_cats.extend(["Slice of Life"] * cat.count("Slice of Life"))
    else:
        cats = cat.split()
        kdramas_cats.extend(cats)

# initialize dict where keys are categories and movies, and the corresponding values are the matching lists
# of the top 50 kdramas
kdramas_dict = {'Category': kdramas_cats, 'Movie': kdramas_movs}

# Construct df from dict and display it
kdramas_df = pd.DataFrame.from_dict(kdramas_dict)
kdramas_df


Unnamed: 0,Category,Movie
0,Action/Thriller,Squid Game
1,Action/Thriller,Vincenzo
2,Action/Thriller,Happiness
3,Action/Thriller,All of Us Are Dead
4,Action/Thriller,My Name
5,Action/Thriller,D.P
6,Action/Thriller,Weak Hero Class 1
7,Action/Thriller,Bloodhounds
8,Romance,Crash Landing on You
9,Romance,Business Proposal
