# Which country watches the most K-drama?

I will be using two datasets: 1) Netflix global ratings available to download from their [official website](https://www.netflix.com/tudum/top10/) and 2) scraped data from [MyDramaList.com](https://mydramalist.com/search?adv=titles&ty=68,77,86&co=3&st=1&so=relevance)

In [389]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import altair as alt

### MyDramaList Scrape Data

First, scraping. I'm pulling a list of K-drama titles from MyDramaList.com:

In [208]:
url = f"https://mydramalist.com/search?adv=titles&ty=68,83,86&co=3&st=3&so=top&page={page}"
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')

In [209]:
# I'm making some functions so that it's easier to extract the info that I need from what I'm scraping.
def find_year(string):
    matches = re.findall(r'(\d{4})', string)
    year = int(matches[0]) if matches else None
    return year

def find_rank(string):
    rank = int(re.search(r'\d+', string).group())
    return rank

In [210]:
cards = soup.select_one(".b-primary")

In [387]:
# I need a list of all K-drama titles that this database has, 
# so I'm going to iterate through all the pages of the MyDramaList database to collect a title master list.

kdrama_titles=[]
page = 1

while page <= 250:
    url = f"https://mydramalist.com/search?adv=titles&ty=68,83,86&co=3&st=3&so=top&page={page}"
    html = requests.get(url)
    soup = BeautifulSoup(html.text, 'html.parser')
    page_cards = soup.select(".b-primary .title a")
    
    page_titles = list(filter(None, [t.text.strip() for t in page_cards]))
    kdrama_titles = kdrama_titles + page_titles
    
    page += 1

AttributeError: 'NoneType' object has no attribute 'remove'

In [412]:
kdrama_titles.remove('Friends')
kdrama_titles.remove('Suits')
kdrama_titles.remove('The Empress')
kdrama_titles.remove('Lucifer')
print(kdrama_titles)



Turns out, there are a couple of K-dramas that have the same name as other popular non-Korean TV shows like "Friends," "Suits", or "Lucifer". Unfortunately, there's no way to specify in my Netflix database that I only want to see the K-drama versions of these shows, so I'm just going to remove these from the list altogether.

Now I'm going to scrape the years to see the trends of the numer of K-dramas and TV shows that have aired over the years.

In [369]:
# List of years

kdrama_years = []
page = 1

while page <= 250:
    url = f"https://mydramalist.com/search?adv=titles&ty=68,83,86&co=3&st=3&so=top&page={page}"
    html = requests.get(url)
    soup = BeautifulSoup(html.text, 'html.parser')
    page_cards = soup.select(".b-primary .text-muted")
    
    page_years = [find_year(y.text.strip()) for y in page_cards]
    kdrama_years = kdrama_years + page_years
    
    page += 1

print(kdrama_years)

[2020, 2021, 2021, 2022, 2020, 2019, 2022, 2021, 2019, 2023, 2023, 2019, 2022, 2021, 2024, 2018, 2022, 2021, 2020, 2021, 2022, 2023, 2021, 2021, 2020, 2020, 2015, 2022, 2017, 2021, 2018, 2023, 2024, 2022, 2017, 2017, 2019, 2023, 2023, 2023, 2018, 2023, 2023, 2016, 2020, 2019, 2017, 2017, 2017, 2020, 2023, 2020, 2022, 2017, 2017, 2022, 2013, 2022, 2021, 2018, 2019, 2021, 2020, 2017, 2021, 2019, 2023, 2022, 2023, 2018, 2016, 2018, 2023, 2019, 2020, 2018, 2018, 2018, 2023, 2021, 2018, 2020, 2020, 2024, 2021, 2023, 2016, 2023, 2020, 2019, 2022, 2022, 2023, 2020, 2017, 2021, 2018, 2019, 2014, 2022, 2024, 2017, 2016, 2021, 2019, 2022, 2017, 2019, 2020, 2013, 2021, 2023, 2022, 2018, 2020, 2021, 2021, 2020, 2020, 2016, 2016, 2021, 2019, 2021, 2019, 2018, 2023, 2024, 2018, 2020, 2021, 2022, 2024, 2013, 2017, 2023, 2020, 2021, 2022, 2019, 2019, 2022, 2020, 2021, 2023, 2021, 2017, 2019, 2021, 2021, 2021, 2024, 2020, 2021, 2021, 2022, 2019, 2024, 2023, 2016, 2019, 2019, 2018, 2021, 2016, 2021, 201

In [371]:
kdrama_year_df = pd.DataFrame({
    'year': kdrama_years
    })
kdrama_year_df

Unnamed: 0,year
0,2020
1,2021
2,2021
3,2022
4,2020
...,...
4807,2023
4808,2018
4809,2013
4810,2021


In [382]:
kdramas_per_year = kdrama_year_df.groupby('year').size().reset_index(name='kdrama_count').drop([36])
kdrama_chart = alt.Chart(kdramas_per_year).mark_line().encode(
    alt.X('year:O', title='Year'),
    alt.Y('kdrama_count:Q', title='Number of K-dramas Aired in Year'),
    color=alt.value('#C54271')
)
kdrama_chart

In [384]:
kdrama_chart.save('kdrama_aired_per_year.html')
kdrama_chart.save('kdrama_aired_per_year.svg')
kdrama_chart.save('kdrama_aired_per_year.png')

### Netflix Rankings Data

Second, pulling in the Netflix country rankings data. This was just available to download as an Excel file on the Netflix official website.

In [289]:
df = pd.read_excel("netflix-rankings-by-country.xlsx")
shows_df = df[df['category'] != 'Films']
shows_df

  warn("Workbook contains no default style, apply openpyxl's default")


Unnamed: 0,country_name,country_iso2,week,category,weekly_rank,show_title,season_title,cumulative_weeks_in_top_10
10,Argentina,AR,2024-06-23,TV,1,Bridgerton,Bridgerton: Season 3,6
11,Argentina,AR,2024-06-23,TV,2,Raising Voices,Raising Voices: Season 1,4
12,Argentina,AR,2024-06-23,TV,3,Gangs of Galicia,Gangs of Galicia: Season 1,1
13,Argentina,AR,2024-06-23,TV,4,Eric,Eric: Limited Series,4
14,Argentina,AR,2024-06-23,TV,5,Bridgerton,Bridgerton: Season 1,8
...,...,...,...,...,...,...,...,...
290855,Vietnam,VN,2021-07-04,TV,6,Reply 1988,Reply 1988: Season 1,1
290856,Vietnam,VN,2021-07-04,TV,7,"Nevertheless,","Nevertheless,: Limited Series",1
290857,Vietnam,VN,2021-07-04,TV,8,Too Hot to Handle,Too Hot to Handle: Season 2,1
290858,Vietnam,VN,2021-07-04,TV,9,Record of Ragnarok,Record of Ragnarok: Season 1,1


Next, I'm going to filter for all the rows where the show_title is in the list of K-drama show titles. If there's a match, that means that  particular K-drama was in the country's top 10 rankings that week. 

In [413]:
kdrama_df = shows_df[shows_df['show_title'].isin(kdrama_titles)]
kdrama_df

Unnamed: 0,country_name,country_iso2,week,category,weekly_rank,show_title,season_title,cumulative_weeks_in_top_10
34,Argentina,AR,2024-06-16,TV,5,Hierarchy,Hierarchy: Limited Series,1
98,Argentina,AR,2024-05-26,TV,9,The 8 Show,The 8 Show: Limited Series,1
139,Argentina,AR,2024-05-12,TV,10,Queen of Tears,Queen of Tears: Limited Series,3
179,Argentina,AR,2024-04-28,TV,10,Queen of Tears,Queen of Tears: Limited Series,2
199,Argentina,AR,2024-04-21,TV,10,Parasyte: The Grey,Parasyte: The Grey: Limited Series,3
...,...,...,...,...,...,...,...,...
290852,Vietnam,VN,2021-07-04,TV,3,Vincenzo,Vincenzo: Season 1,1
290854,Vietnam,VN,2021-07-04,TV,5,Hospital Playlist,Hospital Playlist: Season 2,1
290855,Vietnam,VN,2021-07-04,TV,6,Reply 1988,Reply 1988: Season 1,1
290856,Vietnam,VN,2021-07-04,TV,7,"Nevertheless,","Nevertheless,: Limited Series",1


Now to find the true Koreaboos. I'm going to simply group by the countries and count how many weeks a K-drama made the country's top 10 list. 

In [426]:
koreaboo_df = kdrama_df.groupby('country_name').size().reset_index(name='count').sort_values('count', ascending=False).reset_index(drop=True).set_axis(['DW_NAME', 'VALUE'], axis=1)
print(koreaboo_df)

        DW_NAME  VALUE
0   South Korea   1000
1     Indonesia    868
2       Vietnam    856
3      Malaysia    769
4      Thailand    734
..          ...    ...
89      Iceland     30
90        Malta     29
91       Russia     26
92      Ireland     23
93      Ukraine     22

[94 rows x 2 columns]


In [415]:
# Exporting so that I can use Datawrapper to put it onto a map
koreaboo_df.to_csv('koreaboos.csv')

In [416]:
merged_df = kdrama_df.merge(koreaboos_df, left_on='country_name', right_on='DW_NAME')
merged_df.fillna(0)
merged_df.to_csv('merged_koreaboos.csv')

Now I want to know which shows were the most popular in each country.

In [417]:
# Group by country and show title, and find the show that has the max cumulative weeks in top 10
pop_show = kdrama_df.groupby(['country_name', 'show_title'])['cumulative_weeks_in_top_10'].max().reset_index()
pop_show_by_country = pop_show.sort_values('cumulative_weeks_in_top_10', ascending=False).groupby('country_name').first().reset_index()
pop_show_by_country

Unnamed: 0,country_name,show_title,cumulative_weeks_in_top_10
0,Argentina,Squid Game,11
1,Australia,Squid Game,11
2,Austria,Squid Game,12
3,Bahamas,Squid Game,9
4,Bahrain,Squid Game,16
...,...,...,...
89,United Kingdom,Squid Game,10
90,United States,Squid Game,11
91,Uruguay,Squid Game,9
92,Venezuela,Extraordinary Attorney Woo,11


Which shows were the most popular overall?

In [418]:
# This is looking at how many countries had a particular K-drama for the most cumulative weeks
pop_show_by_country.groupby('show_title').size().reset_index(name='count').sort_values('count', ascending=False)

Unnamed: 0,show_title,count
9,Squid Game,65
12,True Beauty,6
0,Alchemy of Souls,5
4,Extraordinary Attorney Woo,5
10,The Glory,3
5,Hospital Playlist,2
1,Boys Over Flowers,1
2,Business Proposal,1
3,Crash Landing on You,1
6,My Demon,1


In [419]:
# This is more simply looking at how many times a particular K-drama shows up in the database
kdrama_df.groupby('show_title').size().reset_index(name='popularity score').sort_values('popularity score', ascending=False).head(10)

Unnamed: 0,show_title,popularity score
172,Squid Game,1233
10,Alchemy of Souls,623
11,All of Us Are Dead,622
192,The Glory,592
142,Physical: 100,535
58,Extraordinary Attorney Woo,518
114,My Demon,502
219,True Beauty,468
32,Business Proposal,458
88,King the Land,451


In [420]:
pop_show = kdrama_df.groupby(['country_name', 'show_title'])['cumulative_weeks_in_top_10'].max().reset_index()

In [424]:
#Or maybe what I want to say is "this show reached top 10 in this many countries, keeping its spot in the top 10 for an average of __ weeks."
popular_kdramas = pop_show.groupby('show_title').agg(
    num_countries=('country_name', 'count'),
    mean_weeks_in_top_10=('cumulative_weeks_in_top_10', 'mean')
).sort_values('num_countries', ascending=False).reset_index()
top_kdramas=popular_kdramas.head(5)
top_kdramas

Unnamed: 0,show_title,num_countries,mean_weeks_in_top_10
0,Squid Game,94,13.117021
1,All of Us Are Dead,94,6.617021
2,Hellbound,93,2.333333
3,The Glory,91,6.505495
4,Physical: 100,91,3.868132


In [425]:
top_kdramas.to_csv('top_kdramas.csv')

In [427]:
# Trying out a scatterplot to see what it looks like
alt.Chart(popular_kdramas).mark_circle(size=60).encode(
    alt.X('num_countries:O', title='Number of Countries with Show in Top 10'),
    alt.Y('mean_weeks_in_top_10:Q', title='Mean Weeks in Top 10'),
    tooltip=['show_title']).interactive()

Are K-dramas getting more or less popular with each passing year?

In [297]:
kdrama_df['week'] = pd.to_datetime(kdrama_df['week'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  kdrama_df['week'] = pd.to_datetime(kdrama_df['week'])


In [298]:
kdrama_df['year'] = kdrama_df['week'].dt.year

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  kdrama_df['year'] = kdrama_df['week'].dt.year


In [299]:
kdrama_df

Unnamed: 0,country_name,country_iso2,week,category,weekly_rank,show_title,season_title,cumulative_weeks_in_top_10,year
34,Argentina,AR,2024-06-16,TV,5,Hierarchy,Hierarchy: Limited Series,1,2024
98,Argentina,AR,2024-05-26,TV,9,The 8 Show,The 8 Show: Limited Series,1,2024
139,Argentina,AR,2024-05-12,TV,10,Queen of Tears,Queen of Tears: Limited Series,3,2024
179,Argentina,AR,2024-04-28,TV,10,Queen of Tears,Queen of Tears: Limited Series,2,2024
199,Argentina,AR,2024-04-21,TV,10,Parasyte: The Grey,Parasyte: The Grey: Limited Series,3,2024
...,...,...,...,...,...,...,...,...,...
290852,Vietnam,VN,2021-07-04,TV,3,Vincenzo,Vincenzo: Season 1,1,2021
290854,Vietnam,VN,2021-07-04,TV,5,Hospital Playlist,Hospital Playlist: Season 2,1,2021
290855,Vietnam,VN,2021-07-04,TV,6,Reply 1988,Reply 1988: Season 1,1,2021
290856,Vietnam,VN,2021-07-04,TV,7,"Nevertheless,","Nevertheless,: Limited Series",1,2021


In [350]:
# On average, how often does a K-drama make the Netflix top 10 in a year in any given country?

by_country = kdrama_df.groupby(['year', 'country_name']).size().reset_index(name='kdrama_count')
by_year = by_country.groupby('year').mean('kdrama_count').reset_index()
by_year

Unnamed: 0,year,kdrama_count
0,2021,42.56383
1,2022,62.297872
2,2023,67.16129
3,2024,26.849462


In [363]:
by_year['kdrama_percent'] = by_year['kdrama_count']/520
# Dropping 2024 because the year isn't over yet and doesn't show the full picture
by_year = by_year.drop([3])

In [364]:
by_year['year'] = by_year['year'].astype(int)
by_year['kdrama_percent'] = by_year['kdrama_percent'].astype(float)

In [367]:
alt.Chart(by_year).mark_line().encode(
    alt.X('year:O', title='Year'),
    alt.Y('kdrama_percent:Q', title='Percent of year K-drama is in top 10') 
).properties(width=500)