# **Genius Scraper**


### Using Genius to scrape data on the title, artist and song lyrics

---

### **Set-up ⚙️**

Import necessary packages

*⚠️ Note: Do not run this more than once. Restart the kernel before running this code chunk.*

In [1]:
import json
import pandas as pd
from tqdm import tqdm
import os
os.chdir(os.path.expanduser("../"))                 # change directory to main project directory

from dees_package.genius_functions import *         # imports custom functions for genius scraping

Check that we are in the correct current working directory

*⚠️ Note: We should be in the main project directory*

In [2]:
print("Current working directory:", os.getcwd())

Current working directory: /Users/hanbinfeng/Desktop/LSE_Data_Science/ds105a-project-dees-nuts


Open JSON file containing credentials

*⚠️ Note: Our credentials should be stored in a file titled `credentials.json` and stored in the root of the project folder*

In [3]:
credentials_file_path = './credentials.json'

with open(credentials_file_path, 'r') as f:
    credentials = json.load(f)

Initialise a new session

In [4]:
my_session = requests.Session()

---

### **Data Scraping 🔍**

Import cleaned YouTube data into a dataframe

In [5]:
cleaned_df = pd.read_csv('./data/cleaned_youtube_final_data.csv', index_col=0)

Using the `search_genius` function, scrape data on the song title, artist and Genius URL using each video title as the query parameter, and store the scraped data into a new dataframe

*⚠️ Note: This code chunk will take a long time to run and complete*
<br>*⚠️ Note: Genius API access token should be saved under the key `client_access_token` in the `credentials.json` file*

In [6]:
scraped_df = pd.DataFrame([search_genius(q, credentials['client_access_token']) for q in tqdm(cleaned_df['video_title'])])

100%|██████████| 685/685 [07:52<00:00,  1.45it/s]


Using the `scrape_lyrics` function, scrape data on the song lyrics of each song and add the scraped data into the existing dataframe

*⚠️ Note: This code chunk will take a long time to run and complete*

In [7]:
scraped_df['lyrics'] = scraped_df.apply(lambda row: scrape_lyrics(my_session, row['URL']) if row['URL'] else '', axis=1)

Clean dataframe by removing rows with empty lyrics

In [8]:
# scraped_df.dropna(subset=['lyrics'], inplace=True)
# df = scraped_df[scraped_df.astype(bool).any(axis=1)]

Save dataframe to csv

In [9]:
scraped_df.to_csv('./data/raw_data_with_lyrics.csv')

### Data Cleaning

In [10]:
youtube_df = pd.read_csv('./data/cleaned_youtube_final_data.csv')
raw_compiled_data = pd.merge(youtube_df, scraped_df, left_on='video_id', right_on='video_id', sort=False)

KeyError: 'video_id'

In [None]:
raw_compiled_data.to_csv('./data/raw_compiled_data.csv')