# SNSD Song Analysis

## Objective
Analyse song data of korean girl group SNSD from 2007 to 2014 (OT9 period + OT8 song versions of the same period), and find patterns in screentime and lyric distribution in songs and their respective music videos (if any).

## How to get it organized

### Screentime dataframe
The screentime dataframe will be manually gathered from video files data. Each video will have all of its frame extracted and counted for member presences, and then converted to time using the FPS of each video (in process). Final data will contain summaries for solo, centre, side and insignificant times in story and dance categories for each member.

### Lyrics dataframe
The lyrics dataframe will be obtained using a full lyrics index with color coding for each member. Files will be obtained programatically from colorcodedlyrics.com and then cleaned to get each line with its corresponding singing member. Final data will contain summaries for total time, number of individual parts, number of words for each member.

## Building the lyrics dataframe

In [45]:
import requests
import os
from bs4 import BeautifulSoup

### Get links for all song lyrics

In [238]:
# making the soup with the lyrics index

url = 'https://colorcodedlyrics.com/2012/02/snsd_lyrics_index'
response = requests.get(url)

name = url.split('/')[-1] + '.txt'
with open(name, mode = 'wb') as file:
    file.write(response.content)

In [239]:
# getting the links for all lyric pages

with open(name, encoding='utf8') as file:
    soup = BeautifulSoup(file, features='html5lib')
    
    lyrics_index = soup.find('table', attrs='indexlyrics')
    
    links = []
    for link in lyrics_index.find_all('a', href=True):
        links.append(link['href'])

### Cleaning the list

In [240]:
# remove all links after The Boys, but keep HaHaHa, Visual Dreams and Chocolate Love

to_keep = ['https://colorcodedlyrics.com/2013/02/girls-generation-sonyeosidae-hahaha-samsung-cf', 
           'https://colorcodedlyrics.com/2011/02/08/snsd-visual-dream-cc-lyrics', 
           'https://colorcodedlyrics.com/2010/04/28/snsd-chocolate-love-color-coded-lyrics/']

links.index('https://colorcodedlyrics.com/2013/04/girls-generation-feat-snoop-dogg-the-boys') #cutoff link

links = links[0:270]

### Remove all originally non-OT9 songs and duplicates

In [241]:
# list of songs: 7989, talk to me, lost in love, one year later, sailing, cmiyc (kr version), girls (both version)
# remove invalid links (str includes 'preview')

links = [ x for x in links if "7989" not in x and "talk-to-me" not in x and "lost-in-love" not in x and "one-year-later" not in x and "sailing" not in x and "preview" not in x]

In [242]:
links.remove("https://colorcodedlyrics.com/2015/04/girls-generation-sonyeosidae-girls")
links.remove("https://colorcodedlyrics.com/2015/08/girls-generation-girls-japanese-ver")
links.remove("https://colorcodedlyrics.com/2015/04/girls-generation-sonyeosidae-catch-me-if-you-can")

In [243]:
# list of albums: party, lion heart, holiday night

print(links.index('https://colorcodedlyrics.com/2015/08/girls-generation-sonyeosidae-lion-heart'))
print(links.index('https://colorcodedlyrics.com/2017/08/girls-generation-sonyeosidae-light-sky'))

51
72


In [244]:
del links[51:73]

In [245]:
print(links.index('https://colorcodedlyrics.com/2015/07/girls-generation-sonyeosidae-party'))

108


In [246]:
del links[108:110]

In [247]:
print(len(links))

234


In [248]:
#remove duplicates

clean_links = []

for link in links:
    if link not in clean_links:
        clean_links.append(link)

In [249]:
print(len(clean_links))

165


### Download source code for all links

In [250]:
for link in clean_links:
    response = requests.get(link)
    
    if link[-1] == '/':
        name = link.split('/')[-2] + '.txt'
    else:
        name = link.split('/')[-1] + '.txt'
    
    with open(os.path.join('lyrics', name), mode = 'wb') as file:
        file.write(response.content)