# DS 3000 Take-Home Coding Exam 

In [1]:
name = input()
print(f'I, {name}, declare that the following work is entirely my own, and that I did not copy or seek help from any students who have currently or previously taken this course, nor from any online source (incl. AI) other than private messages between myself and the professor on Piazza/via email.')

Tejadatta Kalapatapu
I, Tejadatta Kalapatapu, declare that the following work is entirely my own, and that I did not copy or seek help from any students who have currently or previously taken this course, nor from any online source (incl. AI) other than private messages between myself and the professor on Piazza/via email.


Due: Sunday Oct 13 @ 11:59 PM EST

**Please block out about 100 minutes of time to complete this exam**

### Submission Instructions
Submit this `ipynb` file to Gradescope (this can also be done via the assignment on Canvas).  To ensure that your submitted `ipynb` file represents your latest code, make sure to give a fresh `Kernel > Restart & Run All` just before uploading the `ipynb` file to gradescope. **In addition:**
- "Sign" the academic integrity pledge above with your name
- Make sure you comment your code effectively
- If problems are difficult for the TAs/Profs to grade, you will lose points

### Note: This is an **EXAM**
While I encourage some dialogue (*never* sharing of answers) on most assignments, this is an **EXAM**. You should treat is as such; there is **NO COLLABORATION** allowed. If anyone is found to have copied any of their answers on this exam from any source, they will receive a 0, be reported to OSCCR, and potentially fail the course. Also:
- You are welcome to post a private note on piazza, but to keep a consistent testing environment for all students we are unlikely to provide assistance.
- You may not contact other students with information about this this quiz
    - even saying "it was easy/hard" in a general sense can introduce a bias in favor of students who take the quiz earlier or later
- Under no circumstances should you share a copy of this quiz with anyone who isn't a member of the course staff.

### Tips for success
- Start early
- Remember to use cells and headings to make the notebook easy to read (if a grader cannot find the answer to a problem, you will receive no points for it)
- Under no circumstances may one student view or share their ungraded homework or quiz with another student [(see also)](http://www.northeastern.edu/osccr/academic-integrity)


In [2]:
# the following modules may be helpful in completing the exam
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from datetime import datetime
import pandas as pd
import requests
from bs4 import BeautifulSoup
import json

# Web Scraping and Cleaning Korean Dramas

Your goal is to build a data frame that includes three columns: `category`, `movie/show`, and `year` based on the 70 best Korean Dramas according to [this website](https://www.marieclaire.com/culture/a26895105/best-korean-dramas/).

## Part 1 (25 points)

Use the `requests` module and the `BeautifulSoup` function to scrape the website and then study the html to determine what tags you should use to refine your soup. **Hint:** there should be a single tag which you should be able to use `soup.find_all()` to grab a list that contains all the information we need.

In [3]:
link = "https://www.marieclaire.com/culture/a26895105/best-korean-dramas/"
response = requests.get(link)
soup = BeautifulSoup(response.content, 'html.parser')

In [4]:
#searching for headers with the 'h2' tag
headers = soup.find_all(['h2'])

## Part 2 (25 points)

Create two empty lists, `categories` and `movies_shows`. Use the list of tags you generated from Part 1 to fill those lists out with strings such that they are the same length and that each movie in `movies_shows` is matched with its correct category in `categories`.

In [5]:
categories = []
movies_shows = []

current_category = ""
for header in headers:
    text = header.text.strip()
    #searching for keywords and then adding it to the list if it matches
    if 'Korean Dramas' in text:
        current_category = text
    else:
        categories.append(current_category)
        movies_shows.append(text)

## Part 3 (25 points)

Notice that the strings contained in the `categories` and `movies` lists you created in Part 2 are still somewhat messy, and that we have yet to create the `year` list. Make sure you clean the data so that:

- The categories do not have the `' Korean Dramas'` part of their string.
- The years are extracted from the movies/shows and stored in their own list as integers.
    - **Note 1:** If you extracted the movies and shows the way I did, the third movie/show `Stay In The Know` is not actually a movie or a show, but rather the newsletter signup header. It does not have a date, and should be removed and its corresponding category also removed. Check out the [list.pop function docs](https://docs.python.org/3/tutorial/datastructures.html), but be careful with it!
    - **Note 2:** some of the actual shows are still going on; only grab the **first** year for each movie/show.

In [6]:
import re

def clean_category(category):
    """
    Removes the words and spaces of ' Korean Dramas ' from the category name
    
    Arg:
        Category (str): The original, raw category name
        
    Returns:
        The cleaned category name as a string.
    """
    return category.replace(' Korean Dramas', '').strip()

def extract_year(text):
    """
    Uses the regex package to get the year/years from a string and returns it as an integer or 'None' if there
    is no year found.
    
    Arg:
        text (str): string that has the year info
    
    Returns:
        The year as an integer or 'None'.
    """
    match = re.search(r'\((\d{4}(?:-\d{4})?)\)', text) #searching for the year format
    if match:
        year_str = match.group(1)
        return int(year_str.split('-')[0]) 
    else:
        return None

cleaned_categories = []
cleaned_movies_shows = []
yrs = []

for category, item in zip(categories, movies_shows):
    if item != 'Stay In The Know': #making sure 'Stay In The Know' is excluded
        year = extract_year(item)
        if year:
            cleaned_categories.append(clean_category(category))
            cleaned_movies_shows.append(item.split('(')[0].strip())
            yrs.append(year)

## Part 4 (25 points)

Put your three cleaned lists (`categories`, `movies_shows`, and `year`) into a data frame and output the `.head()`. Then, calculate the average year for each category. Discuss, in a markdown cell:
1. if it seems that certain categories are represented by more recent movies/shows than others (explicitly reference the numerical summary in your answer)
2. what other features might be useful to collect about the movies and shows in order to better understand what makes them popular

In [7]:
#creating the dataframe with the 3 specified columns
df = pd.DataFrame({
    'Category': cleaned_categories,
    'Movie/Show': cleaned_movies_shows,
    'Year': yrs
})
print(df.head())

          Category           Movie/Show  Year
0  Action/Thriller          'Happiness'  2021
1  Action/Thriller       'Pyramid Game'  2024
2  Action/Thriller             'Signal'  2016
3          Romance  'Business Proposal'  2022
4          Romance      'Coffee Prince'  2007


In [8]:
#finding the average years for each category
avg_yrs = df.groupby('Category')['Year'].mean().sort_values(ascending=False)
print("Average year for each category:")
print(avg_yrs)

Average year for each category:
Category
Historical         2021.200000
Action/Thriller    2020.333333
Professional       2020.333333
Melodrama          2020.200000
Slice of Life      2019.222222
Fantasy            2018.666667
Romance            2016.222222
Name: Year, dtype: float64


1. Some categories that are represented by more recent movies/shows than others are Historical, which is on average the most recent with an average of 2021.2, which is followed by Action/Thriller, Professional, and Melodrama, all with an average in 2020. The oldest is Romance with an average of 2016.22, which shows that the industry has taken a drastic change in direction in what they believe are the trendiest categories to make movies/shows about.

2. Other features that might be useful to collect about movies/shows that would help gain a better understanding about what makes them popular is the cast, ratings from reviewers, who directed it, the budget of the movie/show, the episode/season count, if the show won any awards or was nominated, it's trendiness on social media, and what services it's available on. 