# Collection -- Reviews -- NeedleDrop

In this notebook, we're going to tackle collecting reviews from TheNeedleDrop, a prominent critic on Youtube. In order to do this, we are going to have to access the closed caption text that accompanies his videos. We'll also take our own scraped data set and combine it with an older, but extensive, dataset we retrieved from Kaggle. The reason we are going to combine is two-fold: closed captions are not always available and can be turned off at any point by a content creator; second, our Kaggle dataset is only through part of 2018. 

The workflow is as follows:
 1. We'll leverage a library called **YoutubeDataAPI** which is a simple python wrapper around YT API calls. 
 2. From there, we will extract **video_ids** from relevant NeedleDrop playlists
 3. With our video_ids, we will pass these into **YoutubeTranscriptAPI**, a disparate wrapper for grabbing CC
 4. With our own collected data in place, we will clean in and merge with the pre-existing dataset
 

In [1]:
import pandas as pd
from youtube_api import YouTubeDataAPI
from youtube_transcript_api import YouTubeTranscriptApi

## My NeedleDrop Pull

In [2]:
#We're going to need to leverage our Youtube API key to grab ND YT content. We'll do this with a pre-existing python wrapper
api_key = 'AIzaSyCEk20yodR-1N8LL05hD1eJymyf1C1owKQ'

#Instantiate a YT object with our API key
yt = YouTubeDataAPI(api_key)

In [3]:
#Get playlists for our channel
needle_drop_playlists = yt.get_playlists(yt.get_channel_id_from_user("theneedledrop"))

In [4]:
#There are two types of playlists, each may have relevant content. One is the classics
needle_drop_classics_id = needle_drop_playlists[11]['playlist_id']

#...the other is hip hop
needle_drop_hiphop_id = needle_drop_playlists[12]['playlist_id']

In [8]:
#The API call seems to miss the first page of results when a "next page token is passed". 
#So we'll start by pulling and storing all our next pages
#We'll then go back and grab our original page results

check2class = pd.DataFrame(yt.get_videos_from_playlist_id(needle_drop_hiphop_id, next_page_token='CDIQAA'))
check1class = pd.DataFrame(yt.get_videos_from_playlist_id(needle_drop_hiphop_id))

check2 = pd.DataFrame(yt.get_videos_from_playlist_id(needle_drop_classics_id, next_page_token='CDIQAA'))
check1 = pd.DataFrame(yt.get_videos_from_playlist_id(needle_drop_classics_id))

In [10]:
#Merge and drop any duplicates we may have seen
needle_classic  = pd.concat([check1class, check2class]).drop_duplicates()
needle_hiphop  = pd.concat([check1, check2]).drop_duplicates()

In [16]:
#Now we'll grab our unique list of video ids and move into the next phase
needle_hip_hop= pd.concat([needle_hiphop,needle_classic]).drop_duplicates('video_id')

## Grab CC

In [2]:
#We'll take a moment to set up our lists. These will then be read into a dataframe at the end
video_id_list = []
review_text_full=[]
video_titles=[]
video_description=[]
video_tags=[]

#Iterate through each video we have in our list
for video_id in list(needle_hip_hop['video_id']):
    
    
    try:
        #try grabbing the video title, description, and tags      
        response = YouTubeTranscriptApi.get_transcript(video_id)        
        video_titles.append(yt.get_video_metadata(video_id)['video_title'])
        video_description.append(yt.get_video_metadata(video_id)['video_description'])
        video_tags.append(yt.get_video_metadata(video_id)['video_tags'])
        
        #prep a list to store review caption text
        review_text = []
        
        #iterate through our response and join our text together
        for i in range(len(response)):
            review_text.append(response[i]['text'])

        review_text_join = " ".join(review_text)
        review_text_full.append(review_text_join)
        video_id_list.append(video_id)
        
    except:
        #If we couldn't grab anything because captions are disabled, just move on and leave these blank
        review_text_full.append('')
        video_id_list.append(video_id)
        video_titles.append('')
        video_description.append('')
        video_tags.append('')
        

In [34]:
#Read our results into a dataframe
temp_test_needle_df =pd.DataFrame({
        'review_text' : review_text_full,
        'video_id' : video_id_list,
        'video_titles' : video_titles,
        'video_descrip' : video_description,
        'video_tags' : video_tags,  
})

In [39]:
#save the results to csv
temp_test_needle_df.to_csv('./needledrop/needle_drop_scraped_needtocheck.csv', index=False)

In [79]:
#What are we looking at here? YIKES, that's not the cleanest string I've ever seen
pd.set_option('display.max_colwidth',1000)
temp_test_needle_df[145:146][['video_id','video_titles','video_descrip']]

Unnamed: 0,video_id,video_titles,video_descrip
157,qbh0s7OCZOE,A$AP Mob - Cozy Tapes Vol. 2: Too Cozy ALBUM REVIEW,"Listen: https://www.youtube.com/watch?v=LtlO4FVzXew&ab_channel=asapmobVEVO\n\nIt's telling that one of the few highlights on Cozy Tapes Vol. 2 is a track with little to no A$AP Mob presence.\n\nMore hip hop reviews: https://www.youtube.com/playlist?list=PLP4CSgl7K7ormBIO138tYonB949PHnNcP\n\nBuy this album: http://amzn.to/2wlwFHU\n\n===================================\nSubscribe: http://bit.ly/1pBqGCN\n\nOfficial site: http://theneedledrop.com\n\nTND Twitter: http://twitter.com/theneedledrop\n\nTND Facebook: http://facebook.com/theneedledrop\n\nSupport TND: http://theneedledrop.com/support\n===================================\n\nFAV TRACKS: GET THE BAG, BAHAMAS, WHAT HAPPENS, RAF\n\nLEAST FAV TRACK: PLEASE SHUT UP\n\nA$AP MOB - COZY TAPES VOL. 2: TOO COZY / 2017 / RCA / EAST COAST HIP HOP, TRAP RAP, CLOUD RAP\n\n4/10\n\nY'all know this is just my opinion, right?"


## Kaggle NeedleDrop Pull

Read In

In [99]:
#We're going to combine our dataset with a separate Needle Drop review csv we found via Kaggle
kaggle_needle = pd.read_csv('./needledrop/fantano_reviews.csv', encoding='latin-1')
kaggle_needle_captions = pd.read_csv('./needledrop/captions.csv', encoding='latin-1')

In [101]:
#There was an errant column in each of these csvs, drop them
kaggle_needle_captions = kaggle_needle_captions.drop(columns=['Unnamed: 0'])
kaggle_needle = kaggle_needle.drop(columns=['Unnamed: 0'])

In [103]:
#Merge on unique link and check the shape
kaggle_needle = pd.merge(kaggle_needle,kaggle_needle_captions,on='link', how='left')
kaggle_needle.shape

(1734, 10)

Clean

In [104]:
#clean our video_ids so we can compare
kaggle_needle['video_id'] = kaggle_needle['link'].str.replace('https://www.youtube.com/watch?','')
kaggle_needle['video_id'] = kaggle_needle['video_id'].str.replace('?','')
kaggle_needle['video_id'] = kaggle_needle['video_id'].str.replace('v=','')

In [9]:
kaggle_needle.columns

Index(['title', 'artist', 'review_date', 'review_type', 'score', 'word_score',
       'best_tracks', 'worst_track', 'link', 'caption', 'video_id'],
      dtype='object')

Compare / Find Overlap

In [128]:
#Let's compare our data set to this one, do we have more reviews? 
overlap_df = pd.merge(temp_test_needle_df,kaggle_needle, on='video_id', how='left')

In [131]:
overlap_df.loc[overlap_df['review_date'].isnull()].count()[0]

130

..There are 130 reviews that we picked up that are not in this data set. We will take the time to clean our reviews and append dfs

In [135]:
#grab the additional video ids
video_id_additions = list(overlap_df.loc[overlap_df['review_date'].isnull(),'video_id'])
needle_additions_df  = temp_test_needle_df[temp_test_needle_df['video_id'].isin(video_id_additions)]

In [141]:
#Let's take a look and make sure they look like hip hop albums
pd.set_option('display.max_colwidth',100)
needle_additions_df.head()

Unnamed: 0,video_id,review_text,video_titles,video_descrip,video_tags
0,gmKru-Is0SA,[Music] hi everyone gee thinnies uh Stan oh here the Internet's busiest music nerd and it's time...,Kanye West - Jesus Is King ALBUM REVIEW,Listen: https://www.youtube.com/watch?v=AOBQkHy8_p8\n\nKanye's unstoppable egoism and lack of fo...,album|review|music|reviews|indie|underground|new|latest|lyrics|full song|listen|track|concert|li...
1,jtPlfmhPnNg,ah hi everyone ghoul that he's spooked a now here the Internet's creepiest music nerd and it's t...,clipping. - There Existed an Addiction to Blood ALBUM REVIEW,Listen: https://www.youtube.com/watch?v=fIrpLBShe1A\n\nMake room in your annual Halloween-time m...,album|review|music|reviews|indie|underground|new|latest|lyrics|full song|listen|track|concert|li...
2,V2ZTJ8oD4TQ,hi everyone vibe the new check tan oh here the Internet's busiest music nerd and it's time for a...,Danny Brown - uknowhatimsayin¿ ALBUM REVIEW,"Listen: https://www.youtube.com/watch?v=zcloEzJU27E\n\nFor the most part, uknowhatimsayin¿ is an...",album|review|music|reviews|indie|underground|new|latest|lyrics|full song|listen|track|concert|li...
5,zcn-Kp_OWfI,hi everyone fly the kite Tanana here the Internet's busiest music nerd and it is time for a revi...,Ameer Vann - Emmanuel EP REVIEW,Listen: https://www.youtube.com/watch?v=Sd6yXx5ytyI\n\nAmeer Vann's comeback EP is pretty bitter...,album|review|music|reviews|indie|underground|new|latest|lyrics|full song|listen|track|concert|li...
6,ZGmQ9_ceFoY,Oh hi everyone Anthony Fantana here intranets busiest music nerd and it's time for a review of t...,JPEGMAFIA - All My Heroes Are Cornballs ALBUM REVIEW,Listen: https://www.youtube.com/watch?v=d6U8waR9smg\n\nI’m not disappointed. That’s for sure. \n...,album|review|music|reviews|indie|underground|new|latest|lyrics|full song|listen|track|concert|li...


In [143]:
#Save what we have so far, then read back in
needle_additions_df.to_csv('./needledrop/additions_unclean_for_kaggle_data.csv', index=False)
needle_additions_df= pd.read_csv('./needledrop/additions_unclean_for_kaggle_data.csv')

### Clean

In [18]:
pd.set_option('display.max_rows',100)
needle_additions_df.head(100)

Unnamed: 0,video_id,review_text,video_titles,video_descrip,video_tags
0,gmKru-Is0SA,[Music] hi everyone gee thinnies uh Stan oh he...,KANYE WEST - JESUS IS KING,Listen: https://www.youtube.com/watch?v=AOBQkH...,album|review|music|reviews|indie|underground|n...
1,jtPlfmhPnNg,ah hi everyone ghoul that he's spooked a now h...,CLIPPING. - THERE EXISTED AN ADDICTION TO BLOOD,Listen: https://www.youtube.com/watch?v=fIrpLB...,album|review|music|reviews|indie|underground|n...
2,V2ZTJ8oD4TQ,hi everyone vibe the new check tan oh here the...,DANNY BROWN - UKNOWHATIMSAYIN¿,Listen: https://www.youtube.com/watch?v=zcloEz...,album|review|music|reviews|indie|underground|n...
3,zcn-Kp_OWfI,hi everyone fly the kite Tanana here the Inter...,AMEER VANN - EMMANUEL,Listen: https://www.youtube.com/watch?v=Sd6yXx...,album|review|music|reviews|indie|underground|n...
4,ZGmQ9_ceFoY,Oh hi everyone Anthony Fantana here intranets ...,JPEGMAFIA - ALL MY HEROES ARE CORNBALLS,Listen: https://www.youtube.com/watch?v=d6U8wa...,album|review|music|reviews|indie|underground|n...
5,D95K29ZTxpo,Oh Oh a b c d e f g tan oh here the internet's...,IDK - IS HE REAL?,Listen: https://www.youtube.com/watch?v=bSRkdq...,album|review|music|reviews|indie|underground|n...
6,-YlaJuowR1w,hi everyone nice a nice tandem here the Intern...,EARTHGANG - MIRRORLAND,Listen: https://www.youtube.com/watch?v=nAt2op...,album|review|music|reviews|indie|underground|n...
7,Za6s9zOntCE,hi everyone then the tan oh here the Internet'...,POST MALONE - HOLLYWOOD'S BLEEDING,Listen: https://www.youtube.com/watch?v=eXLPdJ...,album|review|music|reviews|indie|underground|n...
8,Yw86LVY5Ayo,hi everyone why the white Ana here the Interne...,DRAKE - CARE PACKAGE COMPILATION REVIEW,Listen: https://www.youtube.com/watch?v=pviZE1...,album|review|music|reviews|indie|underground|n...
9,n6wpOWd92kc,ooh hi everyone ear they popped Anna here the ...,RICH BRIAN - THE SAILOR,Listen: https://www.youtube.com/watch?v=yeW1cC...,album|review|music|reviews|indie|underground|n...


In [17]:
#From what we can see there are a number of issudes with the video titles. 
#These are for the most part Artist and Album names, and it will be important later that we have clean versions
needle_additions_df['video_titles'] = needle_additions_df['video_titles'].str.upper().replace(' EP REVIEW','')
needle_additions_df['video_titles'] = needle_additions_df['video_titles'].str.upper().replace(' ALBUM REVIEW','')
needle_additions_df['video_titles'] = needle_additions_df['video_titles'].str.upper().replace(' MIXTAPE REVIEW','')
needle_additions_df['video_titles'] = needle_additions_df['video_titles'].str.upper().replace(' COMPILATION','')

In [72]:
#Separate the artist and albums
pd.DataFrame(needle_additions_df['video_titles'].str.split('-').values.tolist(), columns=['artist', 'album'])

In [49]:
#grab clean artists and albums, then rewrite them back into our dataframe
clean_artists = list(pd.DataFrame(needle_additions_df['video_titles'].str.split('-').values.tolist())[0])
clean_albums = list(pd.DataFrame(needle_additions_df['video_titles'].str.split('-').values.tolist())[1])
needle_additions_df['artist_name'] = clean_artists
needle_additions_df['album_name'] = clean_albums

### Combine

In [92]:
#rename columns for consistency
needle_additions_df = needle_additions_df.rename(columns={'artist_name':'artist', 'album_name':'title', 'review_text': 'caption'})

In [93]:
needle_additions_df.columns

Index(['video_id', 'caption', 'video_titles', 'video_descrip', 'video_tags',
       'artist', 'title'],
      dtype='object')

In [81]:
kaggle_needle.columns

Index(['title', 'artist', 'review_date', 'review_type', 'score', 'word_score',
       'best_tracks', 'worst_track', 'link', 'caption', 'video_id'],
      dtype='object')

In [79]:
kaggle_needle.shape

(1734, 11)

In [106]:
#merge our dataframes together and save to csv
kaggle_needle = kaggle_needle.append(needle_additions_df[['artist', 'title', 'caption']],ignore_index=True)
kaggle_needle.to_csv('./needledrop/combined_needle_review_data.csv', index=False)