# 1.B. Collection -- User Data -- Genius

In this notebook, we'll build on the data we have already collected from Genuis.com by extracting information on users and their artist preferences. The goal here is to build a matrix of user preferences to be applied to our recommender system. Unfortunately, this data is a little tricker to grab for a couple of reasons:

 - We've been using lyricsgenius as a wrapper for the Genius API, and there is no functionality with this library to access user data.
 - To my knowledge, the **user API endpoint does not contain followed artists**. This means that we need to extract user information from **artist endpoints**.

To accomplish this, we'll go straight to the Genius API and search for artist profiles. The workflow is as follows:

 1. Grab our list of artists from our running list of artist information. *Note: as mentioned previously, many iterations of this list have been created across Wiki, Genius, and Spotify. This is why you see duplicate lists below*
 
 
 2. **To extract our artist information...**
  - Find the most often occuring artist in the response we get. The first option is not *always* the one we want, but we can rely on the mode here
  - Extract the artist name, url and profile so we can access their endpoint
  
  
 3. **To extract user information...**
  - Take the artistID and navigate to the artist profile endpoint.
  - Here, we can iterate through every page of users that follow that artist.
  
A number of different approaches were tested here including the use of **Selenium** to access this information. We saw mixed results in inconsistency with this approach, but the notebook is available if there's interest in looking (with label **"ARCHIVED"**).


In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import numpy as np
import time
import requests
import seaborn as sns
import warnings
import matplotlib.pyplot as plt
import lyricsgenius
import selenium
import sys
import spotipy
import spotipy.util as util
import statistics as stats


In [2]:
#store our genius token
genius_token = "QH034S7zNrqva_ceDCQYvMx4K1MaSwOgABqIfQt8VjUov5mh75oTn89PzV21GwMk"

In [100]:
#Grab our list of rapper names for searching
rapper_list_for_genius_df1 = pd.read_csv('combined_rapper_data_1.csv', header=None)
rapper_list_for_genius_df2 = pd.read_csv('combined_rapper_data_2.csv', header=None)
rapper_list_for_genius_df3 = pd.read_csv('combined_rapper_data_3.csv', header=None)

In [101]:
rapper_list_for_genius_df = pd.concat([rapper_list_for_genius_df1,rapper_list_for_genius_df2,rapper_list_for_genius_df3])

### Genius - Artist DF Creation

In [1]:
#Instantiate empty list for storing our artist data. Specifically, we
#will be looking for rapper, url, api_path, and we'll also make note of anyone we missed.
genius_rapper_list = []
genius_url_list = []
genius_api_path_list = []
missed_rappers=[]

rapper_list_for_genius = list(rapper_list_for_genius_df[1].dropna())

#Iterate through our list of rappers
for rapper in rapper_list_for_genius:
    
    try:

        #build our request to get artist data
        base_url = 'https://api.genius.com'
        headers = {'Authorization': 'Bearer ' + genius_token}
        search_url = base_url + '/search'
        data = {'q': rapper}
        
        #store the response that we get
        response = requests.get(search_url, data=data, headers=headers)
        json = response.json()['response']['hits']

        #if there was anything returned from our search
        if( len(json) > 0):
            #Find the most frequent rapper name in the results. The first result is not reliable
            most_freq_name = stats.mode([json[i]['result']['primary_artist']['name'] for i in range(len(json))])
            genius_rapper_list.append(most_freq_name)
            
            
            #Find the most frequent path in the results. The first result is not reliable
            most_freq_path = stats.mode([json[i]['result']['primary_artist']['api_path'] for i in range(len(json))])
            genius_api_path_list.append(most_freq_path)

            #Find the most frequent url in the results. The first result is not reliable
            most_freq_url = stats.mode([json[i]['result']['primary_artist']['url'] for i in range(len(json))])
            genius_url_list.append(most_freq_url)
        
            print(F"Artist Check: searched for {rapper} and found {most_freq_name}.")
        else:
            
            #if we didn't get anything, append blanks
            print("we didn't get any results. Skipping artist.")
            genius_rapper_list.append('')
            genius_api_path_list.append('')
            genius_url_list.append('')
            missed_rappers.append(rapper)
            
    except:
        print(f'Skipping Artist ({rapper}) because request failed')
        genius_rapper_list.append('')
        genius_api_path_list.append('')
        genius_url_list.append('')
        missed_rappers.append(rapper)

In [114]:
#build our dataframe with the lists that we created
artist_results_df = pd.DataFrame({
    'artist_name' : genius_rapper_list,
    'artist_api_path' : genius_api_path_list,
    'artist_url' : genius_url_list
})

In [122]:
#save it
artist_results_df.to_csv('genius_artist_table.csv', index=False)

### Genius -Follower DF Creation

In [2]:
#Same basic process here-- well make some lists to store our user data within
genius_artist_api_path = []
genius_follower_names = []
genius_follower_id = []
genius_follower_role = []
genius_follower_api_path =[]

#iterate through our list of api_paths
for api_path in list(artist_results_df.artist_api_path.dropna()):
    
    #We start out on page 1, loop 1. We'll keep going until there's no more pages of users to grab.
    page = 1
    loop_check = 1 
    
    #While there are still users
    while(loop_check):
        print(api_path)
            
        try:
            #build our retuqest
            base_url = 'https://genius.com/api' + api_path + '/followers?page=' + str(page)      
            response = requests.get(base_url)
            json = response.json()
            
            #If we aren't seeing many followers, we can end our loop
            if(len(json['response']['followers']) <2):
                loop_check=0
            
            #Otherwise, we need to grab follower name, id, role, and api path
            for i in range(len(json['response']['followers'])):
                genius_artist_api_path.append(api_path)
                genius_follower_names.append(json['response']['followers'][i]['name'])
                genius_follower_id.append(json['response']['followers'][i]['id'])
                genius_follower_role.append(json['response']['followers'][i]['role_for_display'])
                genius_follower_api_path.append(json['response']['followers'][i]['api_path'])     
            
            #increment our page value
            page+=1
                
        except:
            #We need to insert blanks
            genius_artist_api_path.append('')
            genius_follower_names.append('')
            genius_follower_id.append('')
            genius_follower_role.append('')
            genius_follower_api_path.append('')
            loop_check=0

        

In [165]:
#Build our dataframe
follower_df_test = pd.DataFrame({
    'artist_api_path' : genius_artist_api_path,
    'follower_name' : genius_follower_names,
    'follower_id' : genius_follower_id,
    'follower_role' : genius_follower_role,
    'follower_api_path' : genius_follower_api_path
})

In [177]:
#If the user is present, it's a follow. We'll use this as a proxy for ratings (unfortunately don't have this information)
follower_df_test['follow'] = 1

In [179]:
#Save to DF
follower_df_test.to_csv('follower_genius_500k.csv', index=False)

In [183]:
#insert artist names
follower_df_test_w_names = pd.merge(follower_df_test,artist_results_df, how='left', on='artist_api_path')

In [184]:
follower_df_test_w_names.columns

Index(['artist_api_path', 'follower_name', 'follower_id', 'follower_role',
       'follower_api_path', 'follow', 'artist_name', 'artist_url'],
      dtype='object')

In [220]:
#Save this version to csv as well
follower_df_test_w_names.to_csv('follower_genius_500k.csv', index=False)