# Project Proposal

**Title**

An Analysis of American-Asian Artists in the Top Music Charts

**Problem Statement**

Historically, Western culture has always been the dominant force in popular culture, but progress marched on and the world became globalized. We would like to know how popular artists of mixed ethnicities are becoming in the global music scene by observing a subset of American-Asian artists.

**Motivation**

Inclusivity is a rising sentiment in the world today. While movements for ethnicities such as African American and Latin American have gained support and led to big changes, some ethnicities seem to be overlooked. However, in recent years, people have been becoming more vocal about Asian inclusion and Asians have been fighting for their turn in the spotlight. We have seen the rise of Asian representation in movies such as “Crazy Rich Asians” and TV series such as “Kim’s Convenience”. The group now wants to find out if the same is true in the music scene, given the rise in popularity of Asian bands like BTS, and quantify the participation of Asians to the current music scene.

**Datasets**

* Wikipedia: 

    * Pages on Category: American musicians of Asian Descent and subcategories per Asian Nationality

* Billboard weekly chart results for the past 10 years (2011 to 2021)

* Spotify

**Methodology**

* Obtain a list of american musicians with asian descent using: https://en.wikipedia.org/wiki/Category:American_musicians_of_Asian_descent

* Scrape through the Billboard top artists for at least the past decade (2011 - 2021) to obtain the 100 top artists for each week. 

* Do the same for the Spotify API using at minimum the following playlists:

    * Top 50 Global

    * Today’s Top Hits

* Analyze the performance of the artists from the wikipedia list in the Billboard 100 data and Spotify data. 

* We look at performance of the artist as the number of times an artist’s song has appeared on the Top 100 Billboard Chart, Spotify’s Top 50 Global and Top Hits

* Observe the trend on the total number of appearances across all American-Asian artists over the 10 year period from 2011 to 2021



# Importing Packages and Specifying Proxies

In [None]:
import os
import re
import time
import json
import pickle
import requests
import numpy as np
import pandas as pd
from tqdm import tqdm
from datetime import datetime, timedelta
from bs4 import BeautifulSoup

In [None]:
proxies = {
  'http': 'http://206.189.157.23',
  'https': 'http://206.189.157.23',
}

In [None]:
# Create directories
current_wd = os.getcwd() #Path of current working directory
try:
    os.mkdir('{}/pre_processed_data'.format(current_wd))
    # os.mkdir('{}/post_processed_data'.format(os.path.dirname(current_wd)))
except:
    pass

# Wikipedia Artists/Musicians Web Scraping

## Asian-American Scraping

In [None]:
param = {
    'action': 'query',
    'list': 'categorymembers',
    'cmtitle': 'Category:American musicians of Asian descent',
    'cmtype': 'subcat',
    'cmlimit': '1000',
    'format': 'xml',
}
req = requests.get("http://en.wikipedia.org/w/api.php", params=param)
soup = BeautifulSoup(req.text)
subcateg = [cm['title'] for cm in soup.select('cm') if
            'Asian descent' not in cm['title']]

artists_in_country = {}
for sc in subcateg:
    param = {
        'action': 'query',
        'list': 'categorymembers',
        'cmtitle': f'{sc}',
        'cmtype': 'page',
        'cmlimit': '500',
        'format': 'xml',
    }
    req = requests.get("http://en.wikipedia.org/w/api.php", params=param)
    time.sleep(2)
    soup = BeautifulSoup(req.text)
    pages = [cm['title'] for cm in soup.select('cm')]
    country = re.findall(r'(\b\w+\b)(?= descent)', sc)[0]
    artists_in_country[country] = pages

# Uncomment if rewriting files.
filename = "./pre_processed_data/wiki_AsianAmerican_musicians.pkl"
with open(filename, "wb") as file:
    pickle.dump(artists_in_country, file)

## South-East Asian and Asian Scraping

In [None]:
asian_people = ['South Korean', 'Chinese', 'Hong Kong',
                'Japanese', 'Mongolian', 'Filipino',
                'Taiwanese', 'Bruneian', 'Cambodian',
                'East Timorese', 'Indonesian', 'Laotian',
                'Malaysian', 'Burmese', 'Singaporean',
                'Thai', 'Vietnamese']
remove_words = ['(group)', '(band)', '(musician)', '(music)', '(composer)',
                '(singer)', '(artist)']
remove_words = remove_words + \
    ['('+a + ' band)' for a in asian_people] + \
    ['('+a + ' singer)' for a in asian_people]
categories_not_to_scrape = ['songwriters', 'composers',
                            'by',  # Except `by genre`
                            'Wikipedia categories', 'conductors',
                            'List of awards', 'dynasties',
                            'dynasty musicians', 'Kingdoms', 'musician stubs',
                            'Albums', 'albums', 'EPs', 'concert tours',
                            'concerts', 'Songs', 'songs', 'discography']

In [None]:
def crawl_wiki(subcateg):
    """Recursively crawl a Wikipedia Category and retrieve all pages and
    band/group categories into a list.

    Parameters
    -----------
    subcateg : str
        Subcategory Webpage with the format "Category:{text} musicians".

    Returns
    --------
    pages : list of strings
        List of pages or group/member categories of musicians inside
        a Wikipedia Category.
    """

    params = {
        "action": "query",
        "format": "xml",
        "list": "categorymembers",
        "cmtitle": "",
        "cmtype": "subcat|page",
        "cmlimit": "500"
    }
    params['cmtitle'] = subcateg
    req = requests.get("http://en.wikipedia.org/w/api.php",
                       params=params,
                       proxies=proxies)
    time.sleep(2)
    soup = BeautifulSoup(req.text)

    # Get Tags for Subcategories and Pages
    subcategs_links = soup.select('cm[ns="14"]')
    pages_links = soup.select('cm[ns="0"]')

    # Extract information from tags
    subcategs = [s['title'] for s in subcategs_links]
    pages = [p['title'] for p in pages_links]

    # Remove redundant and out-of-scope subcategories. Append band and group
    # subcategories to pages list.
    def retain_categ(x): return all(word not in x
                                    if 'by genre' not in x
                                    else True
                                    for word in categories_not_to_scrape)
    subcategs = [s for s in subcategs if retain_categ(s)]
    bands_subcategs = [x for x in subcategs if 'members' in x]

    pages = pages + bands_subcategs
    #print(subcateg, pages)
    if len(subcategs) != 0:
        for s in subcategs:
            pgs = crawl_wiki(s)
            pages = pages + pgs
    pages = list(set(pages))
    pages = [page.replace(r, '') for page in pages for r in remove_words]
    return pages

In [None]:
for asian in tqdm(asian_people):
    asian_dictionaries = {}
    category = f"Category:{asian} musicians"
    nation = re.findall(r'Category:(.+) musicians', category)[0]
    asian_dictionaries[nation] = crawl_wiki(category)

    # Uncomment if rewriting files.
    filename = f"./pre_processed_data/wiki_{nation}_musicians.pkl"
    with open(filename, "wb") as file:
        pickle.dump(asian_dictionaries, file)

# Billboard Scraping

## Notes

* Weekly
    * Records are every Monday
        * The Hot 100 Chart - Start August 4 1958 ; End August 2, 2021
        * Billboard 200 Chart - Start August 17 1963 ; End August 2, 2021
        * Billboard Global 200 Chart - Start September 19 2020 ; End August 2, 2021
        * Billboard Global 200 Excl US - Start September 19 2020 ; End August 2, 2021
        * Artist 100 Chart - July 19 2014 ; End August 2, 2021
* Yearly
    * Hot 100 - Starts on 2004
    * Billboard 200 - Starts on 2002
* Decade
    * Hot 100 - Only available for 2010s
    * Billboard 200 - Only available for 2010s
    * Top Artists - Only available for 2010s

## Weekly Data Scraping

###  The Hot 100 Chart

In [None]:
weekly_hot_100 = {}
dates = pd.date_range(start=str('2011-01-01'),
                      end=str('2021-08-01'),
                      freq='W-MON').strftime('%Y-%m-%d').tolist()
for d in tqdm(dates):
    req = requests.get(f'https://www.billboard.com/charts/hot-100/{d}',
                       proxies=proxies)
    time.sleep(5)
    soup = BeautifulSoup(req.text)
    top100 = soup.select('div[class="chart-list container"]')[0]
    song_lines = top100.select(
        'span.chart-element__information > '
        'span.chart-element__information__song.'
        'text--truncate.color--primary')
    singer_lines = top100.select(
        'span.chart-element__information > '
        'span.chart-element__information__artist.'
        'text--truncate.color--secondary')
    song_singer = [(s.text, a.text) for s,
                   a in zip(song_lines, singer_lines)]
    weekly_hot_100[d] = song_singer

# Uncomment if rewriting files.
filename = "./pre_processed_data/weekly_hot_100.pkl"
with open(filename, "wb") as file:
    pickle.dump(weekly_hot_100, file)

### Billboard 200 Chart

In [None]:
weekly_billboard_200 = {}
dates = pd.date_range(start=str('2011-01-01'),
                      end=str('2021-02-01'),
                      freq='W-MON').strftime('%Y-%m-%d').tolist()
for d in tqdm(dates):
    req = requests.get(f'https://www.billboard.com/charts/billboard-200/{d}',
                       proxies=proxies)
    time.sleep(5)
    soup = BeautifulSoup(req.text)
    billboard200 = soup.select('div[class="chart-list container"]')[0]
    song_lines = billboard200.select(
        'span.chart-element__information > '
        'span.chart-element__information__song.'
        'text--truncate.color--primary')
    singer_lines = billboard200.select(
        'span.chart-element__information > '
        'span.chart-element__information__artist.'
        'text--truncate.color--secondary')
    song_singer = [(s.text, a.text) for s, a in zip(song_lines, singer_lines)]
    weekly_billboard_200[d] = song_singer

# Uncomment if rewriting files.
filename = "./pre_processed_data/weekly_billboard_200.pkl"
with open(filename, "wb") as file:
    pickle.dump(weekly_billboard_200, file)

### Billboard Global 200

In [None]:
weekly_billboard_global_200 = {}
dates = pd.date_range(start=str('2020-09-19'),
                      end=str('2021-02-01'),
                      freq='W-MON').strftime('%Y-%m-%d').tolist()
for d in tqdm(dates):
    req = requests.get(
        f'https://www.billboard.com/charts/billboard-global-200/{d}',
        proxies=proxies)
    time.sleep(2)
    soup = BeautifulSoup(req.text)
    artist = soup.select('div[class="chart-list-item__artist"]')
    song = soup.select('span[class="chart-list-item__title-text"]')
    song_singer = [(s.text.strip(), a.text.strip())
                   for a, s in zip(artist, song)]
    weekly_billboard_global_200[d] = song_singer

# Uncomment if rewriting files.
filename = "./pre_processed_data/weekly_billboard_global_200.pkl"
with open(filename, "wb") as file:
    pickle.dump(weekly_billboard_global_200, file)

100%|██████████| 20/20 [02:41<00:00,  8.08s/it]


### Billboard Global 200 Excl US

In [None]:
weekly_billboard_global_exclUS_200 = {}
dates = pd.date_range(start=str('2020-09-19'),
                      end=str('2021-02-01'),
                      freq='W-MON').strftime('%Y-%m-%d').tolist()
for d in tqdm(dates):
    req = requests.get(
        f'https://www.billboard.com/charts/billboard-global-excl-us/{d}',
        proxies=proxies)
    time.sleep(5)
    soup = BeautifulSoup(req.text)
    artist = soup.select('div[class="chart-list-item__artist"]')
    song = soup.select('span[class="chart-list-item__title-text"]')
    song_singer = [(s.text.strip(), a.text.strip())
                   for a, s in zip(artist, song)]
    weekly_billboard_global_exclUS_200[d] = song_singer

# Uncomment if rewriting files.
filename = "./pre_processed_data/weekly_billboard_global_exclUS_200.pkl"
with open(filename, "wb") as file:
    pickle.dump(weekly_billboard_global_exclUS_200, file)

### Top 100 Artists

In [None]:
weekly_top_artist_100 = {}
dates = pd.date_range(start=str('2014-07-19'),
                      end=str('2021-02-01'),
                      freq='W-MON').strftime('%Y-%m-%d').tolist()
for d in tqdm(dates):
    req = requests.get(f'https://www.billboard.com/charts/artist-100/{d}',
                       proxies = proxies)
    time.sleep(2)
    soup = BeautifulSoup(req.text)
    if req.status_code != 200:
        print(req, d)
    artist = soup.select('span[class="chart-list-item__title-text"]')
    singer = [a.text.strip() for a in artist]
    weekly_top_artist_100[d] = singer
    
#Uncomment if rewriting files.
filename = "./pre_processed_data/weekly_top_artist_100.pkl"
with open(filename,"wb") as file:
    pickle.dump(weekly_top_artist_100, file)

100%|██████████| 342/342 [30:54<00:00,  5.42s/it]


# Spotify Scraping

## Notes

* Top 200
    * Weekly
        * Links - Format https://spotifycharts.com/regional/global/weekly/2021-07-30--2021-08-06
            * Every 7 days after 2016 12 29
            * %Y-%m-%d--%Y-%m-%d
            * Start 2016 12 29 ; End 2021 08 05
        * Global
        * US
    * Daily
        * Links - Format https://spotifycharts.com/regional/global/daily/2017-01-01
            * Every day after 2017 01 01
            * %Y-%m-%d
            * Start 2017 01 01 ; End 2021 08 05 
        * Global
        * US
* Viral 50
    * Weekly
        * Links - Format https://spotifycharts.com/viral/global/weekly/2017-01-05--2017-01-05
            * Every 7 days after 2017 01 05
            * %Y-%m-%d--%Y-%m-%d
            * Start 2017 01 05 ; End 2021 08 05
        * Global
        * US
    * Daily
        * Links - Format https://spotifycharts.com/viral/global/daily/2017-01-01
            * Every day after 2017 01 01
            * %Y-%m-%d
            * Start 2017 01 01 ; End 2021 08 05 
        * Global
        * US
* https://towardsdatascience.com/billboard-hot-100-analytics-using-data-to-understand-the-shift-in-popular-music-in-the-last-60-ac3919d39b49

In [None]:
top200_weekly_dates = pd.date_range(start=str('2016-12-23'),
                                    end=str('2021-08-05'),
                      freq='D').strftime('%Y-%m-%d').tolist()[::7]
top200_daily_dates = pd.date_range(start=str('2017-01-01'),
                                    end=str('2021-08-05'),
                      freq='D').strftime('%Y-%m-%d').tolist()
viral50_weekly_dates = pd.date_range(start=str('2017-01-05'),
                                    end=str('2021-08-05'),
                      freq='D').strftime('%Y-%m-%d').tolist()[::7]
viral50_daily_dates = pd.date_range(start=str('2017-01-05'),
                                    end=str('2021-08-05'),
                      freq='D').strftime('%Y-%m-%d').tolist()

In [None]:
top200_weekly_dates = [(top200_weekly_dates[i], top200_weekly_dates[i+1])
                         for i in range(len(top200_weekly_dates)-1)]
viral50_weekly_dates = [(viral50_weekly_dates[i], viral50_weekly_dates[i+1])
                         for i in range(len(viral50_weekly_dates)-1)]

### Top 200 Weekly Global

In [None]:
top200_weekly_global = {}

for d in tqdm(top200_weekly_dates):
    header = {'user-agent':
              'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0)'
              ' Gecko/20100101 Firefox/90.0'}
    spotify_url = 'https://spotifycharts.com/regional/global/weekly/'
    req = requests.get(f'{spotify_url}{d[0]}--{d[1]}',
                       proxies=proxies,
                       headers=header)
    time.sleep(2)
    if req.status_code != 200:
        print(req, f'{spotify_url}{d[0]}--{d[1]}')
    soup = BeautifulSoup(req.text)
    tracks = soup.select('td[class="chart-table-track"]')
    song_singer = [(t.strong.text, t.span.text) for t in tracks]
    top200_weekly_global[d[1]] = song_singer
    stream = soup.select('td[class="chart-table-streams"]')
    stream_count = [s.text for s in stream]
    top200_weekly_global[d[0]] = (song_singer, stream_count)

# Uncomment if rewriting files.
filename = "./pre_processed_data/spotify_top200_weekly_global.pkl"
with open(filename,"wb") as file:
    pickle.dump(top200_weekly_global, file)

 10%|▉         | 23/240 [01:34<14:14,  3.94s/it]

<Response [404]> https://spotifycharts.com/regional/global/weekly/2017-05-26--2017-06-02


 10%|█         | 24/240 [01:38<13:52,  3.85s/it]

<Response [404]> https://spotifycharts.com/regional/global/weekly/2017-06-02--2017-06-09


100%|██████████| 240/240 [17:10<00:00,  4.29s/it]


### Top 200 Weekly US

In [None]:
top200_weekly_us = {}

for d in tqdm(top200_weekly_dates):
    header = {'user-agent':
              'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0)'
              ' Gecko/20100101 Firefox/90.0'}
    spotify_url = 'https://spotifycharts.com/regional/us/weekly/'
    req = requests.get(f'{spotify_url}{d[0]}--{d[1]}',
                       proxies=proxies,
                       headers=header)
    time.sleep(2)
    if req.status_code != 200:
        print(req, f'{spotify_url}{d[0]}--{d[1]}')
    soup = BeautifulSoup(req.text)
    tracks = soup.select('td[class="chart-table-track"]')
    song_singer = [(t.strong.text, t.span.text) for t in tracks]
    top200_weekly_us[d[1]] = song_singer
    stream = soup.select('td[class="chart-table-streams"]')
    stream_count = [s.text for s in stream]
    top200_weekly_us[d[0]] = (song_singer, stream_count)

# Uncomment if rewriting files.
filename = "./pre_processed_data/spotify_top200_weekly_us.pkl"
with open(filename, "wb") as file:
    pickle.dump(top200_weekly_us, file)

100%|██████████| 240/240 [17:57<00:00,  4.49s/it]


### Top 200 Daily Global

In [None]:
top200_daily_global = {}

for d in tqdm(top200_daily_dates):
    header = {'user-agent':
              'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0)'
              ' Gecko/20100101 Firefox/90.0'}
    spotify_url = 'https://spotifycharts.com/regional/global/daily/'
    req = requests.get(f'{spotify_url}{d}',
                       proxies=proxies,
                       headers=header)
    time.sleep(2)
    if req.status_code != 200:
        print(req, f'{spotify_url}{d}')
    soup = BeautifulSoup(req.text)
    tracks = soup.select('td[class="chart-table-track"]')
    song_singer = [(t.strong.text, t.span.text) for t in tracks]
    top200_daily_global[d] = song_singer

# Uncomment if rewriting files.
filename = "./pre_processed_data/spotify_top200_daily_global.pkl"
with open(filename,"wb") as file:
    pickle.dump(top200_daily_global, file)

### Top 200 Daily US

In [None]:
top200_daily_us = {}

for d in tqdm(top200_daily_dates):
    header = {'user-agent':
              'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0)'
              ' Gecko/20100101 Firefox/90.0'}
    spotify_url = 'https://spotifycharts.com/regional/us/daily/'
    req = requests.get(f'{spotify_url}{d}',
                       proxies=proxies,
                       headers=header)
    time.sleep(2)
    if req.status_code != 200:
        print(req, f'{spotify_url}{d}')
    soup = BeautifulSoup(req.text)
    tracks = soup.select('td[class="chart-table-track"]')
    streams = soup.select('td[class="chart-table-streams"]')
    song_singer = [(t.strong.text, t.span.text, z.text)
                   for t, z in zip(tracks, streams)]
    top200_daily_us[d] = song_singer

# Uncomment if rewriting files.
filename = "./pre_processed_data/spotify_top200_daily_us_v2.pkl"
with open(filename,"wb") as file:
    pickle.dump(top200_daily_us, file)

  0%|          | 0/1678 [00:00<?, ?it/s]

[('Bad and Boujee (feat. Lil Uzi Vert)', 'by Migos', '1,371,493'),
 ('Fake Love', 'by Drake', '1,180,074'),
 ('Starboy', 'by The Weeknd, Daft Punk', '1,064,351'),
 ('Closer', 'by The Chainsmokers, Halsey', '1,010,492'),
 ('Black Beatles', 'by Rae Sremmurd, Gucci Mane', '874,289'),
 ('Broccoli (feat. Lil Yachty)', 'by Shelley FKA DRAM', '763,259'),
 ('One Dance', 'by Drake, WizKid, Kyla', '753,150'),
 ('Caroline', 'by Aminé', '714,839'),
 ('Let Me Love You', 'by DJ Snake, Justin Bieber', '690,483'),
 ('Bounce Back', 'by Big Sean', '682,688'),
 ('I Feel It Coming', 'by The Weeknd, Daft Punk', '651,807'),
 ('24K Magic', 'by Bruno Mars', '574,974'),
 ('Bad Things (with Camila Cabello)', 'by Machine Gun Kelly', '567,789'),
 ('X (feat. Future)', 'by 21 Savage, Metro Boomin', '544,620'),
 ('I Don’t Wanna Live Forever (Fifty Shades Darker) - From "Fifty Shades Darker (Original Motion Picture Soundtrack)"',
  'by ZAYN, Taylor Swift',
  '507,450'),
 ("Don't Wanna Know", 'by Maroon 5, Kendrick La

  0%|          | 0/1678 [00:04<?, ?it/s]


NameError: name 'top200_daily_us' is not defined

### Viral 50 Weekly Global

In [None]:
viral50_weekly_global = {}

for d in tqdm(viral50_weekly_dates):
    header = {'user-agent':
              'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0)'
              ' Gecko/20100101 Firefox/90.0'}
    spotify_url = 'https://spotifycharts.com/viral/global/weekly/'
    req = requests.get(f'{spotify_url}{d[0]}--{d[0]}',
                       proxies=proxies,
                       headers=header)
    time.sleep(2)
    if req.status_code != 200:
        print(req, f'{spotify_url}{d[0]}--{d[0]}')
    soup = BeautifulSoup(req.text)
    tracks = soup.select('td[class="chart-table-track"]')
    song_singer = [(t.strong.text, t.span.text) for t in tracks]
    viral50_weekly_global[d[0]] = song_singer

# Uncomment if rewriting files.
filename = "./pre_processed_data/spotify_viral50_weekly_global.pkl"
with open(filename,"wb") as file:
    pickle.dump(viral50_weekly_global, file)


### Viral 50 Weekly US

In [None]:
viral50_weekly_us = {}

for d in tqdm(viral50_weekly_dates):
    header = {'user-agent':
              'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0)'
              ' Gecko/20100101 Firefox/90.0'}
    spotify_url = 'https://spotifycharts.com/viral/us/weekly/'
    req = requests.get(f'{spotify_url}{d[0]}--{d[0]}',
                       proxies=proxies,
                       headers=header)
    time.sleep(2)
    if req.status_code != 200:
        print(req, f'{spotify_url}{d[0]}--{d[0]}')
    soup = BeautifulSoup(req.text)
    tracks = soup.select('td[class="chart-table-track"]')
    song_singer = [(t.strong.text, t.span.text) for t in tracks]
    viral50_weekly_us[d[0]] = song_singer

# Uncomment if rewriting files.
filename = "./pre_processed_data/spotify_viral50_weekly_us.pkl"
with open(filename,"wb") as file:
    pickle.dump(viral50_weekly_us, file)


### Viral 50 Daily Global

In [None]:
viral50_daily_global = {}

for d in tqdm(viral50_daily_dates):
    header = {'user-agent':
              'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0)'
              ' Gecko/20100101 Firefox/90.0'}
    spotify_url = 'https://spotifycharts.com/viral/global/daily/'
    req = requests.get(f'{spotify_url}{d}',
                       proxies=proxies,
                       headers=header)
    time.sleep(2)
    if req.status_code != 200:
        print(req, f'{spotify_url}{d}')
    soup = BeautifulSoup(req.text)
    tracks = soup.select('td[class="chart-table-track"]')
    song_singer = [(t.strong.text, t.span.text) for t in tracks]
    viral50_daily_global[d] = song_singer

# Uncomment if rewriting files.
filename = "./pre_processed_data/spotify_viral50_daily_global.pkl"
with open(filename, "wb") as file:
    pickle.dump(viral50_daily_global, file)

### Viral 50 Daily US

In [None]:
viral50_daily_US = {}

for d in tqdm(viral50_daily_dates):
    header = {'user-agent':
              'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0)'
              ' Gecko/20100101 Firefox/90.0'}
    spotify_url = 'https://spotifycharts.com/viral/us/daily/'
    req = requests.get(f'{spotify_url}{d}',
                       proxies=proxies,
                       headers=header)
    time.sleep(2)
    if req.status_code != 200:
        print(req, f'{spotify_url}{d}')
    soup = BeautifulSoup(req.text)
    tracks = soup.select('td[class="chart-table-track"]')
    song_singer = [(t.strong.text, t.span.text) for t in tracks]
    viral50_daily_US[d] = song_singer

# Uncomment if rewriting files.
filename = "./pre_processed_data/spotify_viral50_daily_us.pkl"
with open(filename, "wb") as file:
    pickle.dump(viral50_daily_US, file)

# Wikipedia Artists DataFrame Creation

In [None]:
# create df for all wik artists
df_wik = pd.DataFrame(columns=['artist', 'country', 'lineage'])

# create and append df for bruneian musicians
pkl = pd.read_pickle(r'wiki_Bruneian_musicians.pkl')
dct = {}
for k, v in pkl.items():
    dct[k] = list(set(v))

dfc = pd.DataFrame(columns=['artist', 'country'])
for country in list(dct.keys()):
    artists = dct[country]
    dfc_ = pd.DataFrame(
        data={'artist': artists, 'country': [country]*len(artists)})
    dfc = pd.concat([dfc, dfc_]).reset_index(drop=True)
dfc['lineage'] = 'Asian'
df_wik = pd.concat([df_wik, dfc])

# create and append df for burmese musicians
pkl = pd.read_pickle(r'wiki_Burmese_musicians.pkl')
dct = {}
for k, v in pkl.items():
    dct[k] = list(set(v))

dfc = pd.DataFrame(columns=['artist', 'country'])
for country in list(dct.keys()):
    artists = dct[country]
    dfc_ = pd.DataFrame(
        data={'artist': artists, 'country': [country]*len(artists)})
    dfc = pd.concat([dfc, dfc_]).reset_index(drop=True)
dfc['lineage'] = 'Asian'
df_wik = pd.concat([df_wik, dfc])

# create and append df for cambodian musicians
pkl = pd.read_pickle(r'wiki_Cambodian_musicians.pkl')
dct = {}
for k, v in pkl.items():
    dct[k] = list(set(v))

dfc = pd.DataFrame(columns=['artist', 'country'])
for country in list(dct.keys()):
    artists = dct[country]
    dfc_ = pd.DataFrame(
        data={'artist': artists, 'country': [country]*len(artists)})
    dfc = pd.concat([dfc, dfc_]).reset_index(drop=True)
dfc['lineage'] = 'Asian'
df_wik = pd.concat([df_wik, dfc])

# create and append df for chinese musicians
pkl = pd.read_pickle(r'wiki_Chinese_musicians.pkl')
dct = {}
for k, v in pkl.items():
    dct[k] = list(set(v))

dfc = pd.DataFrame(columns=['artist', 'country'])
for country in list(dct.keys()):
    artists = dct[country]
    dfc_ = pd.DataFrame(
        data={'artist': artists, 'country': [country]*len(artists)})
    dfc = pd.concat([dfc, dfc_]).reset_index(drop=True)
dfc['lineage'] = 'Asian'
df_wik = pd.concat([df_wik, dfc])

# create and append df for East Timorese musicians
pkl = pd.read_pickle(r'wiki_East Timorese_musicians.pkl')
dct = {}
for k, v in pkl.items():
    dct[k] = list(set(v))

dfc = pd.DataFrame(columns=['artist', 'country'])
for country in list(dct.keys()):
    artists = dct[country]
    dfc_ = pd.DataFrame(
        data={'artist': artists, 'country': [country]*len(artists)})
    dfc = pd.concat([dfc, dfc_]).reset_index(drop=True)
dfc['lineage'] = 'Asian'
df_wik = pd.concat([df_wik, dfc])

# create and append df for filipino musicians
pkl = pd.read_pickle(r'wiki_Filipino_musicians.pkl')
dct = {}
for k, v in pkl.items():
    dct[k] = list(set(v))

dfc = pd.DataFrame(columns=['artist', 'country'])
for country in list(dct.keys()):
    artists = dct[country]
    dfc_ = pd.DataFrame(
        data={'artist': artists, 'country': [country]*len(artists)})
    dfc = pd.concat([dfc, dfc_]).reset_index(drop=True)
dfc['lineage'] = 'Asian'
df_wik = pd.concat([df_wik, dfc])

# create and append df for hk musicians
pkl = pd.read_pickle(r'wiki_Hong Kong_musicians.pkl')
dct = {}
for k, v in pkl.items():
    dct[k] = list(set(v))

dfc = pd.DataFrame(columns=['artist', 'country'])
for country in list(dct.keys()):
    artists = dct[country]
    dfc_ = pd.DataFrame(
        data={'artist': artists, 'country': [country]*len(artists)})
    dfc = pd.concat([dfc, dfc_]).reset_index(drop=True)
dfc['lineage'] = 'Asian'
df_wik = pd.concat([df_wik, dfc])

# create and append df for indonesian musicians
pkl = pd.read_pickle(r'wiki_Indonesian_musicians.pkl')
dct = {}
for k, v in pkl.items():
    dct[k] = list(set(v))

dfc = pd.DataFrame(columns=['artist', 'country'])
for country in list(dct.keys()):
    artists = dct[country]
    dfc_ = pd.DataFrame(
        data={'artist': artists, 'country': [country]*len(artists)})
    dfc = pd.concat([dfc, dfc_]).reset_index(drop=True)
dfc['lineage'] = 'Asian'
df_wik = pd.concat([df_wik, dfc])

# create and append df for japanese musicians
pkl = pd.read_pickle(r'wiki_Japanese_musicians.pkl')
dct = {}
for k, v in pkl.items():
    dct[k] = list(set(v))

dfc = pd.DataFrame(columns=['artist', 'country'])
for country in list(dct.keys()):
    artists = dct[country]
    dfc_ = pd.DataFrame(
        data={'artist': artists, 'country': [country]*len(artists)})
    dfc = pd.concat([dfc, dfc_]).reset_index(drop=True)
dfc['lineage'] = 'Asian'
df_wik = pd.concat([df_wik, dfc])

# create and append df for laotian musicians
pkl = pd.read_pickle(r'wiki_Laotian_musicians.pkl')
dct = {}
for k, v in pkl.items():
    dct[k] = list(set(v))

dfc = pd.DataFrame(columns=['artist', 'country'])
for country in list(dct.keys()):
    artists = dct[country]
    dfc_ = pd.DataFrame(
        data={'artist': artists, 'country': [country]*len(artists)})
    dfc = pd.concat([dfc, dfc_]).reset_index(drop=True)
dfc['lineage'] = 'Asian'
df_wik = pd.concat([df_wik, dfc])

# create and append df for malaysian musicians
pkl = pd.read_pickle(r'wiki_Malaysian_musicians.pkl')
dct = {}
for k, v in pkl.items():
    dct[k] = list(set(v))

dfc = pd.DataFrame(columns=['artist', 'country'])
for country in list(dct.keys()):
    artists = dct[country]
    dfc_ = pd.DataFrame(
        data={'artist': artists, 'country': [country]*len(artists)})
    dfc = pd.concat([dfc, dfc_]).reset_index(drop=True)
dfc['lineage'] = 'Asian'
df_wik = pd.concat([df_wik, dfc])

# create and append df for mongolian musicians
pkl = pd.read_pickle(r'wiki_Mongolian_musicians.pkl')
dct = {}
for k, v in pkl.items():
    dct[k] = list(set(v))

dfc = pd.DataFrame(columns=['artist', 'country'])
for country in list(dct.keys()):
    artists = dct[country]
    dfc_ = pd.DataFrame(
        data={'artist': artists, 'country': [country]*len(artists)})
    dfc = pd.concat([dfc, dfc_]).reset_index(drop=True)
dfc['lineage'] = 'Asian'
df_wik = pd.concat([df_wik, dfc])

# create and append df for singaporean musicians
pkl = pd.read_pickle(r'wiki_Singaporean_musicians.pkl')
dct = {}
for k, v in pkl.items():
    dct[k] = list(set(v))

dfc = pd.DataFrame(columns=['artist', 'country'])
for country in list(dct.keys()):
    artists = dct[country]
    dfc_ = pd.DataFrame(
        data={'artist': artists, 'country': [country]*len(artists)})
    dfc = pd.concat([dfc, dfc_]).reset_index(drop=True)
dfc['lineage'] = 'Asian'
df_wik = pd.concat([df_wik, dfc])

# create and append df for south korean musicians
pkl = pd.read_pickle(r'wiki_South Korean_musicians.pkl')
dct = {}
for k, v in pkl.items():
    dct[k] = list(set(v))

dfc = pd.DataFrame(columns=['artist', 'country'])
for country in list(dct.keys()):
    artists = dct[country]
    dfc_ = pd.DataFrame(
        data={'artist': artists, 'country': [country]*len(artists)})
    dfc = pd.concat([dfc, dfc_]).reset_index(drop=True)
dfc['lineage'] = 'Asian'
df_wik = pd.concat([df_wik, dfc])

# create and append df for taiwanese musicians
pkl = pd.read_pickle(r'wiki_Taiwanese_musicians.pkl')
dct = {}
for k, v in pkl.items():
    dct[k] = list(set(v))

dfc = pd.DataFrame(columns=['artist', 'country'])
for country in list(dct.keys()):
    artists = dct[country]
    dfc_ = pd.DataFrame(
        data={'artist': artists, 'country': [country]*len(artists)})
    dfc = pd.concat([dfc, dfc_]).reset_index(drop=True)
dfc['lineage'] = 'Asian'
df_wik = pd.concat([df_wik, dfc])

# create and append df for thai musicians
pkl = pd.read_pickle(r'wiki_Thai_musicians.pkl')
dct = {}
for k, v in pkl.items():
    dct[k] = list(set(v))

dfc = pd.DataFrame(columns=['artist', 'country'])
for country in list(dct.keys()):
    artists = dct[country]
    dfc_ = pd.DataFrame(
        data={'artist': artists, 'country': [country]*len(artists)})
    dfc = pd.concat([dfc, dfc_]).reset_index(drop=True)
dfc['lineage'] = 'Asian'
df_wik = pd.concat([df_wik, dfc])

# create and append df for vietnamese musicians
pkl = pd.read_pickle(r'wiki_Vietnamese_musicians.pkl')
dct = {}
for k, v in pkl.items():
    dct[k] = list(set(v))

dfc = pd.DataFrame(columns=['artist', 'country'])
for country in list(dct.keys()):
    artists = dct[country]
    dfc_ = pd.DataFrame(
        data={'artist': artists, 'country': [country]*len(artists)})
    dfc = pd.concat([dfc, dfc_]).reset_index(drop=True)
dfc['lineage'] = 'Asian'
df_wik = pd.concat([df_wik, dfc])

# create and append df for asian american musicians
pkl = pd.read_pickle(r'wiki_AsianAmerican_musicians.pkl')
dct = {}
for k, v in pkl.items():
    dct[k] = list(set(v))

dfc = pd.DataFrame(columns=['artist', 'country'])
for country in list(dct.keys()):
    artists = dct[country]
    dfc_ = pd.DataFrame(
        data={'artist': artists, 'country': [country]*len(artists)})
    dfc = pd.concat([dfc, dfc_]).reset_index(drop=True)
dfc['country'] = dfc['country'] + ' American'
dfc['lineage'] = 'Mixed Asian'
df_wik = pd.concat([df_wik, dfc])

# clean
def remove_parenthesis(x):
    """Remove Parenthesis from a String and Strip of extra whitespaces."""
    word = re.findall(r'(.*)\s?\(.*\)',x)[0]
    return word.strip()

df_wik.drop_duplicates(subset='artist', keep='first', inplace=True)
df_wik['artist'] = df_wik['artist'].apply(lambda x: remove_parenthesis(x)
                                          if r'(' in x else x)
df_wik.to_csv('./post_processed_data/df_wik.csv')

# SQLite DB creation and storage

## Notes

Scraped files to be stored in the following tables:

1.   Table: **wiki_artists**

2.   Table: **billboard_artists**

3.   Table: **billboard**

4.   Table: **spotify**

In [None]:
db = './pre_processed_data/lab2.db'

In [None]:
# """ create a database connection to a SQLite database """
# conn = None
# try:
#     conn = sqlite3.connect(db)
#     print(sqlite3.version)
# except Error as e:
#     print(e)
# finally:
#     if conn:
#         conn.close()


# conn.execute("""
# CREATE TABLE IF NOT EXISTS spotify (
#     chartname TEXT NOT NULL,
#     chartfreq TEXT NOT NULL,
#     date TEXT NOT NULL,
#     region TEXT NOT NULL,
#     rank INT NOT NULL,
#     song TEXT NOT NULL,
#     artist TEXT NOT NULL,
#     artist2 TEXT NOT NULL,
#     stream_count INT
# )
# """)

# cursor.execute("""
# CREATE TABLE IF NOT EXISTS billboard (
#     chartname TEXT NOT NULL,
#     date TEXT NOT NULL,
#     song TEXT NOT NULL,
#     artist TEXT NOT NULL,
#     rank INTEGER NULL
# )
# """)

# cursor.execute("""
# CREATE TABLE IF NOT EXISTS wiki_artists (
#     name TEXT NOT NULL,
#     country TEXT NOT NULL,
#     lineage text not null
# )
# """)

# cursor.execute("""
# CREATE TABLE IF NOT EXISTS billboard_artists (
#     date TEXT NOT NULL,
#     artist TEXT NOT NULL,
#     rank INTEGER NULL
# )
# """)


In [None]:
conn = sqlite3.connect(db)
cursor = conn.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = cursor.fetchall()
tables

##  Wiki Artists to SQLite

In [None]:
# upload artists from Wikipedia
df_wik = pd.read_csv(r'./post_processed_data/df_wik.csv')
conn = sqlite3.connect(db)

x, y = df_wik.shape
for i in range(0, x):

    strt1 = df_wik.iloc[i, 0]
    strt1 = str(strt1).replace('"', '')

    strt2 = df_wik.iloc[i, 1]
    strt2 = strt2.replace('"', '')
    strt3 = df_wik.iloc[i, 2]
    comm = 'insert into wiki_artists values (' + strt4 + ')'

    strt4 = ('"' + str(strt1) + '" , "' + strt2 + '", "' + strt3 + '"')
    comm = 'insert into wiki_artists values (' + strt4 + ')'

    conn.execute(comm)

conn.execute('commit')
conn.close()


In [None]:
# change case of name to lowercase
conn = sqlite3.connect(db)
conn.execute('update wiki_artists set name = lower(name)')

conn.execute('commit')
conn.close()


In [None]:
conn = sqlite3.connect(db)
conn.execute('delete from wiki_artists where lower(name) like "category%"')

conn.execute('commit')
conn.close()


## Billboard pickles to SQLite

In [None]:
# upload billboard weekly hot 100
pkl = pd.read_pickle(r'./pre_processed_data/weekly_hot_100.pkl')
print(len(pkl))
conn = sqlite3.connect(db)

for i in pkl:
    y = 1
    for x in range(len(pkl[i])):
        strt1 = pkl[i][x][0]

        strt2 = pkl[i][x][1]
        strt2 = strt2.replace('"', '')
        strt3 = ('"weekly_hot_100","' + i + '","' + strt1 + '" , "' + 
                 strt2 + '", ' + str(y)) 

        comm = 'insert into billboard values (' + strt3 + ')' 
        conn.execute(comm)
#        print (comm)
        y += 1

conn.execute("commit")
conn.close()


In [None]:
# upload billboard 200
pkl = pd.read_pickle(r'./pre_processed_data/weekly_billboard_200.pkl')
print(len(pkl))
conn = sqlite3.connect(db)

for i in pkl:
    y = 1
    for x in range(len(pkl[i])):

        strt1 = pkl[i][x][0]
        strt1 = strt1.replace('"', '')

        strt2 = pkl[i][x][1]
        strt2 = strt2.replace('"', '')
        strt3 = ('"weekly_billboard_200","' + i + '","' + strt1 +
                 '" , "' + strt2 + '", ' + str(y)) 

        comm = 'insert into billboard values (' + strt3 + ')' 
        conn.execute(comm)
#        print (comm)
        y += 1

conn.execute("commit")
conn.close()


In [None]:

# upload billboard weekly billboard global 200
pkl = pd.read_pickle(r'./pre_processed_data/weekly_billboard_global_200.pkl')
conn = sqlite3.connect(db)

for i in pkl:
    y = 1
    for x in range(len(pkl[i])):
        strt1 = pkl[i][x][0]

        strt2 = pkl[i][x][1]
        strt2 = strt2.replace('"', '')
        strt3 = ('"weekly_billboard_global_200","' + i + '","' + strt1 + 
                 '" , "' + strt2 + '",' + str(y))

        comm = 'insert into billboard values (' + strt3 + ')'
        conn.execute(comm)
#        print (comm)
        y += 1

conn.execute('commit')
conn.close()


In [None]:
# upload billboard weekly billboard global excluding US 200
pkl = pd.read_pickle(
        r'./pre_processed_data/weekly_billboard_global_exclUS_200.pkl')
conn = sqlite3.connect(db)

for i in pkl:
    y = 1
    for x in range(len(pkl[i])):
        strt1 = pkl[i][x][0]

        strt2 = pkl[i][x][1]
        strt2 = strt2.replace('"', '')
        strt3 = ('"weekly_billboard_global_exclUS_200","' + i + '","' + 
                 strt1 + '" , "' + strt2 + '",' + str(y))

        comm = 'insert into billboard values (' + strt3 + ')'
        conn.execute(comm)
        y += 1
#        print (comm)

conn.execute('commit')
conn.close()



In [None]:
# change case of artist to lowercase
conn = sqlite3.connect(db)
conn.execute('update billboard set artist = lower(artist)')

conn.execute('commit')
conn.close()


In [None]:
# remove characters after 'Feat.'
conn = sqlite3.connect(db)
conn.execute("""
update billboard 
set artist = (substr(artist, 1, instr(artist, 'feat.') - 2)) 
WHERE artist like '%feat.%'
""")
conn.execute("commit")
conn.close()


In [None]:
# remove characters after 'featuring'
conn = sqlite3.connect(db)
conn.execute("""
update billboard 
set artist = (substr(artist, 1, instr(artist, 'featuring') - 2)) 
WHERE artist like '%featuring%'
""")
conn.execute("commit")
conn.close()    


In [None]:
# scope out august 2021
conn = sqlite3.connect(db)
conn.execute("DELETE FROM billboard WHERE date > '2021-08-01'")
conn.execute("commit")
conn.close()

## Billboard Artists to SQLite

In [None]:
# upload Billboard Artists
pkl = pd.read_pickle(r'./pre_processed_data/weekly_top_artist_100.pkl')
conn = sqlite3.connect(db)

for i in pkl:
    y = 1
    for x in range(len(pkl[i])):
        strt1 = pkl[i][x][0]
        strt2 = pkl[i][x]
        strt2 = strt2.replace('"', '')
        strt3 = ('"' + i + '","' + strt2 + '" ')

        comm = 'insert into billboard_artists values (' + strt3 + ')'
#        print (comm)
        conn.execute(comm)
        y += 1

conn.execute('commit')
conn.close()

In [None]:
conn = sqlite3.connect(db)
conn.execute("""
update billboard_artists
set artist = lower(artist)
""")
conn.execute("commit")
conn.close()

## Spotify pickles to SQLite 

Viral50 Charts (No stream_count)

In [None]:
# WITHOUT STREAM

spotify_files = [
    './pre_processed_data/spotify_viral50_daily_global.pkl',
    './pre_processed_data/spotify_viral50_daily_us.pkl',
    './pre_processed_data/spotify_viral50_weekly_global.pkl',
    './pre_processed_data/spotify_viral50_weekly_us.pkl'
]

# Connect to db
conn = sqlite3.connect(db)

for fp in spotify_files:
    f = fp.split('.')[0].split('_')

    print(fp)

    # Open pickle file
    spotify_pickle = pd.read_pickle(fp)

    # Build format and insert to SQL
    for chart_date in spotify_pickle.keys():
        # for chart_date in v50gb.keys():

        artists = []
        artists2 = []
        songs = []
        streams = []
        rank = []
        # Split artists and get feature artist
        for i, item in enumerate(spotify_pickle[chart_date]):
            if len(item[1]) < 1:
                continue
            a_list = item[1].split('by ')[1].split(', ')

            for artist in a_list:
                rank.append(i+1)
                artists.append(spotify_pickle[chart_date][i][1])
                artists2.append(str.lower(artist))
                songs.append(item[0])

        # Build DF
        df_s = pd.DataFrame({'artist': artists,
                             'artist2': artists2,
                             'rank': rank,
                             'song': songs
                             })
        df_s['date'] = chart_date
        df_s['stream_count'] = 0
        df_s['chartname'] = f[1]
        df_s['chartfreq'] = f[2]
        df_s['region'] = f[3]
        df_s = df_s[['chartname', 'chartfreq', 'date', 'region', 'rank',
                     'song', 'artist', 'artist2', 'stream_count']]
        df_s = df_s.set_index('chartname')

        # update batch to DB
        df_s.to_sql('spotify', con=conn, if_exists='append')

conn.close()

Top200 Daily (with streams)

In [None]:
# WITH STREAMs second batch
# FILENAMES

spotify_files = [
    './pre_processed_data/spotify_top200_daily_global_v2.pkl',
    './pre_processed_data/spotify_top200_daily_us_v2.pkl'
]

# Connect to db
conn = sqlite3.connect(db)

for fp in spotify_files:
    f = fp.split('.')[0].split('_')

    print(fp)

    # Open pickle file
    spotify_pickle = pd.read_pickle(fp)

    # Build format and insert to SQL
    for chart_date in spotify_pickle.keys():

        artists = []
        artists2 = []
        songs = []
        streams = []
        rank = []
        # Split artists and get feature artist
        for i, item in enumerate(spotify_pickle[chart_date]):
            if len(item[1]) < 1:
                continue
            a_list = item[1].split('by ')[1].split(', ')

            for artist in a_list:
                rank.append(i+1)
                artists.append(spotify_pickle[chart_date][i][1])
                artists2.append(str.lower(artist))
                songs.append(item[0])
                streams.append(int(item[2].replace(',', '')))

        df_s = pd.DataFrame({'artist': artists,
                             'artist2': artists2,
                             'stream_count': streams,
                             'rank': rank,
                             'song': songs
                             })
        df_s['date'] = chart_date
        df_s['chartname'] = f[1]
        df_s['chartfreq'] = f[2]
        df_s['region'] = f[3]
        df_s = df_s[['chartname', 'chartfreq', 'date', 'region', 'rank',
                     'song', 'artist', 'artist2', 'stream_count']]
        df_s = df_s.set_index('chartname')

        # update batch to DB
        df_s.to_sql('spotify', con=conn, if_exists='append')

conn.close()

Top200 Weekly (with streams)

In [None]:
# WITH STREAM first batch

spotify_files = ['./pre_processed_data/spotify_top200_weekly_global_v2.pkl',
                 './pre_processed_data/spotify_top200_weekly_us_v2.pkl']

# Connect to db
conn = sqlite3.connect(db)

for fp in spotify_files:
    f = fp.split('.')[0].split('_')

    print(fp)

    # Open pickle file
    spotify_pickle = pd.read_pickle(fp)

    # Build format and insert to SQL
    for chart_date in spotify_pickle.keys():

        artists = []
        artists2 = []
        songs = []
        streams = []
        rank = []
        # Split artists and get feature artist
        for i, item in enumerate(spotify_pickle[chart_date][0]):
            if len(item[1]) < 1:
                continue
            a_list = item[1].split('by ')[1].split(', ')

            for artist in a_list:
                rank.append(i+1)
                artists.append(spotify_pickle[chart_date][0][i][1])
                artists2.append(str.lower(artist))
                songs.append(item[0])
                stream_info = spotify_pickle[chart_date][1][i]
                streams.append(int(stream_info.replace(',', '')))
        df_s = pd.DataFrame({'artist': artists,
                             'artist2': artists2,
                             'stream_count': streams,
                             'rank': rank,
                             'song': songs
                             })
        df_s['date'] = chart_date
        df_s['chartname'] = f[1]
        df_s['chartfreq'] = f[2]
        df_s['region'] = f[3]
        df_s = df_s[['chartname', 'chartfreq', 'date', 'region', 'rank',
                     'song', 'artist', 'artist2', 'stream_count']]
        df_s = df_s.set_index('chartname')

        # update batch to DB
        df_s.to_sql('spotify', con=conn, if_exists='append')

conn.close()

In [None]:
# scope out august 2021 for all charts
conn = sqlite3.connect(db)
conn.execute("DELETE FROM spotify WHERE date > '2021-08-01'")
conn.execute("commit")
conn.close()