# ETL Project by Johneson Giang

## SCOPE:
### - Extracted, transformed, and loaded up YouTube's Top Trending Videos from December 2017 thru May 2018 for their videos categorized as music only, and created an "Artist" column to enable joining with Spotify's Top 100 Songs of 2018. Both dataframes were loaded into MySQL.


## PURPOSE:
### - I choose this project because I'm a avid listener and a huge music and concert goer, and wanted to work with data that I was familiar with.

### Data Sources - Kaggle
 - https://www.kaggle.com/datasnaek/youtube-new/downloads/youtube-new.zip/114
 - https://www.kaggle.com/nadintamer/top-spotify-tracks-of-2018
 



## Step 1) Import Dependencies

In [1]:
#!pip install PyMySQL



In [2]:
# Import Dependencies:
import os
import csv
import json
import numpy as np
import pandas as pd
from datetime import datetime
import simplejson
import sys
import string
from sqlalchemy import create_engine
import pymysql
pymysql.install_as_MySQLdb()

## Step 2) "Extract" the data

In [3]:
# YouTube Data - Raw CSV

csv_file_yt = "youtube_USvideos.csv"
yt_rawdata_df = pd.read_csv(csv_file_yt, encoding='utf-8')
#yt_rawdata_df

In [4]:
# Load the JSON Category file and print to see the categories

#YouTube Data - Raw JSON - Categories
yt_json_file = "youtube_US_category_id.json"
yt_rawjson_df = pd.read_json(yt_json_file)
list(yt_rawjson_df)
for i in yt_rawjson_df.iterrows():
    print(i[1].items)

<bound method Series.iteritems of kind                     youtube#videoCategoryListResponse
etag     "m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv...
items    {'kind': 'youtube#videoCategory', 'etag': '"m2...
Name: 0, dtype: object>
<bound method Series.iteritems of kind                     youtube#videoCategoryListResponse
etag     "m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv...
items    {'kind': 'youtube#videoCategory', 'etag': '"m2...
Name: 1, dtype: object>
<bound method Series.iteritems of kind                     youtube#videoCategoryListResponse
etag     "m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv...
items    {'kind': 'youtube#videoCategory', 'etag': '"m2...
Name: 2, dtype: object>
<bound method Series.iteritems of kind                     youtube#videoCategoryListResponse
etag     "m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv...
items    {'kind': 'youtube#videoCategory', 'etag': '"m2...
Name: 3, dtype: object>
<bound method Series.iteritems of kind                     y

In [5]:
# Spotify 2018 - Top 100 Songs - Raw CSV

csv_file_spotify2018 = "spotify_top2018.csv"
spotify2018_rawdata_df = pd.read_csv(csv_file_spotify2018, encoding='utf-8')
spotify2018_rawdata_df.head()

Unnamed: 0,id,name,artists,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,6DCZcSspjsKoFjzjrWoCd,God's Plan,Drake,0.754,0.449,7.0,-9.211,1.0,0.109,0.0332,8.3e-05,0.552,0.357,77.169,198973.0,4.0
1,3ee8Jmje8o58CHK66QrVC,SAD!,XXXTENTACION,0.74,0.613,8.0,-4.88,1.0,0.145,0.258,0.00372,0.123,0.473,75.023,166606.0,4.0
2,0e7ipj03S05BNilyu5bRz,rockstar (feat. 21 Savage),Post Malone,0.587,0.535,5.0,-6.09,0.0,0.0898,0.117,6.6e-05,0.131,0.14,159.847,218147.0,4.0
3,3swc6WTsr7rl9DqQKQA55,Psycho (feat. Ty Dolla $ign),Post Malone,0.739,0.559,8.0,-8.011,1.0,0.117,0.58,0.0,0.112,0.439,140.124,221440.0,4.0
4,2G7V7zsVDxg1yRsu7Ew9R,In My Feelings,Drake,0.835,0.626,1.0,-5.833,1.0,0.125,0.0589,6e-05,0.396,0.35,91.03,217925.0,4.0


## Step 3) "Transform" the data (clean, manipulate, and etc.)

In [6]:
# Figure out if there's any missing information
yt_rawdata_df.count()

video_id                  40949
trending_date             40949
title                     40949
channel_title             40949
category_id               40949
publish_time              40949
tags                      40949
views                     40949
likes                     40949
dislikes                  40949
comment_count             40949
thumbnail_link            40949
comments_disabled         40949
ratings_disabled          40949
video_error_or_removed    40949
description               40379
dtype: int64

In [42]:
# Explore the data and types
yt_rawdata_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40949 entries, 0 to 40948
Data columns (total 16 columns):
video_id                  40949 non-null object
trending_date             40949 non-null object
title                     40949 non-null object
channel_title             40949 non-null object
category_id               40949 non-null int64
publish_time              40949 non-null object
tags                      40949 non-null object
views                     40949 non-null int64
likes                     40949 non-null int64
dislikes                  40949 non-null int64
comment_count             40949 non-null int64
thumbnail_link            40949 non-null object
comments_disabled         40949 non-null bool
ratings_disabled          40949 non-null bool
video_error_or_removed    40949 non-null bool
description               40379 non-null object
dtypes: bool(3), int64(5), object(8)
memory usage: 4.2+ MB


In [8]:
# Clean the column names - note to self use underscores next time instead of spaces
yt_cleandata_df = yt_rawdata_df.rename(columns={"video_id":"Video ID", "trending_date":"Trending Date",
                                                "title":"Title", "channel_title":"Channel Title",
                                                "category_id":"Category Titles", "publish_time":"Publish Time",
                                                "tags":"Tags", "views":"Views",
                                                "likes":"Likes", "dislikes":"Dislikes", 
                                                "comment_count":"Comment Count", "thumbnail_link":"Thumbnail Link",
                                                "comments_disabled":"Comments Disabled", "ratings_disabled":"Ratings Disabled",
                                                "video_error_or_removed":"Video Error Or Removed", "description":"Description"
                                               })
yt_cleandata_df

Unnamed: 0,Video ID,Trending Date,Title,Channel Title,Category Titles,Publish Time,Tags,Views,Likes,Dislikes,Comment Count,Thumbnail Link,Comments Disabled,Ratings Disabled,Video Error Or Removed,Description
0,2kyS6SvSYSE,17.14.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
1,1ZAPwfrtAFY,17.14.11,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13T07:30:00.000Z,"last week tonight trump presidency|""last week ...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John..."
2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...
3,puqaWrEC7tY,17.14.11,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,24,2017-11-13T11:00:04.000Z,"rhett and link|""gmm""|""good mythical morning""|""...",343168,10172,666,2146,https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg,False,False,False,Today we find out if Link is a Nickelback amat...
4,d380meD0W0M,17.14.11,I Dare You: GOING BALD!?,nigahiga,24,2017-11-12T18:01:41.000Z,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095731,132235,1989,17518,https://i.ytimg.com/vi/d380meD0W0M/default.jpg,False,False,False,I know it's been a while since we did this sho...
5,gHZ1Qz0KiKM,17.14.11,2 Weeks with iPhone X,iJustine,28,2017-11-13T19:07:23.000Z,"ijustine|""week with iPhone X""|""iphone x""|""appl...",119180,9763,511,1434,https://i.ytimg.com/vi/gHZ1Qz0KiKM/default.jpg,False,False,False,Using the iPhone for the past two weeks -- her...
6,39idVpFF7NQ,17.14.11,Roy Moore & Jeff Sessions Cold Open - SNL,Saturday Night Live,24,2017-11-12T05:37:17.000Z,"SNL|""Saturday Night Live""|""SNL Season 43""|""Epi...",2103417,15993,2445,1970,https://i.ytimg.com/vi/39idVpFF7NQ/default.jpg,False,False,False,Embattled Alabama Senate candidate Roy Moore (...
7,nc99ccSXST0,17.14.11,5 Ice Cream Gadgets put to the Test,CrazyRussianHacker,28,2017-11-12T21:50:37.000Z,"5 Ice Cream Gadgets|""Ice Cream""|""Cream Sandwic...",817732,23663,778,3432,https://i.ytimg.com/vi/nc99ccSXST0/default.jpg,False,False,False,Ice Cream Pint Combination Lock - http://amzn....
8,jr9QtXwC9vc,17.14.11,The Greatest Showman | Official Trailer 2 [HD]...,20th Century Fox,1,2017-11-13T14:00:23.000Z,"Trailer|""Hugh Jackman""|""Michelle Williams""|""Za...",826059,3543,119,340,https://i.ytimg.com/vi/jr9QtXwC9vc/default.jpg,False,False,False,"Inspired by the imagination of P.T. Barnum, Th..."
9,TUmyygCMMGA,17.14.11,Why the rise of the robots won’t mean the end ...,Vox,25,2017-11-13T13:45:16.000Z,"vox.com|""vox""|""explain""|""shift change""|""future...",256426,12654,1363,2368,https://i.ytimg.com/vi/TUmyygCMMGA/default.jpg,False,False,False,"For now, at least, we have better things to wo..."


In [9]:
# Drop Cells with Missing Information
yt_cleandata_df = yt_cleandata_df.dropna(how="any")
#yt_cleandata_df.info()

In [10]:
# Drop Dulplicates and Sort by Trending Date by chaining
yt_cleandata_df.drop_duplicates(['Video ID', 'Trending Date', 'Title', 'Channel Title', 'Category Titles', 'Publish Time']).sort_values(by=['Trending Date'], ascending=False)

Unnamed: 0,Video ID,Trending Date,Title,Channel Title,Category Titles,Publish Time,Tags,Views,Likes,Dislikes,Comment Count,Thumbnail Link,Comments Disabled,Ratings Disabled,Video Error Or Removed,Description
38051,EqeIRzY7hIU,18.31.05,6 Cheese Gadgets put to the Test!,CrazyRussianHacker,28,2018-05-20T18:58:15.000Z,"Cheese Gadgets|""Gadgets""|""Cheese""|""kitchen gad...",1519038,26770,1557,2875,https://i.ytimg.com/vi/EqeIRzY7hIU/default.jpg,False,False,False,$1000 Survival Kit in a Case - https://youtu.b...
38103,MAjY8mCTXWk,18.31.05,"周杰倫 Jay Chou【不愛我就拉倒 If You Don't Love Me, It's...",杰威爾音樂 JVR Music,10,2018-05-14T15:59:47.000Z,"周杰倫|""Jay""|""Chou""|""周董""|""周杰伦""|""周傑倫""|""杰威尔""|""周周""|""...",17259071,132009,9552,14789,https://i.ytimg.com/vi/MAjY8mCTXWk/default.jpg,False,False,False,詞：周杰倫、宋健彰（彈頭） 曲：周杰倫MV導演：周杰倫憂鬱型男的走心旋律 用英式搖滾宣洩...
38081,LzuDyq0-1LM,18.31.05,Why it's not a British royal wedding without f...,Vox,25,2018-05-18T11:00:03.000Z,"pop culture|""royal wedding""|""fancy hats""|""prin...",484199,8246,662,761,https://i.ytimg.com/vi/LzuDyq0-1LM/default.jpg,False,False,False,Fantastical fascinators at royal weddings are ...
38082,6SuMbFuKDf8,18.31.05,Backstreet Boys - Don't Go Breaking My Heart (...,BackstreetBoysVEVO,10,2018-05-17T04:00:01.000Z,"Backstreet Boys|""Don't Go Breaking My Heart""|""...",14717193,396826,16015,39035,https://i.ytimg.com/vi/6SuMbFuKDf8/default.jpg,False,False,False,Get the Backstreet Boys new single “Don’t Go B...
38083,yDiXQl7grPQ,18.31.05,Do You Hear Yanny or Laurel? (SOLVED with SCIE...,AsapSCIENCE,28,2018-05-16T18:16:26.000Z,"AsapSCIENCE|""audio illusion""|""yanny""|""laurel""|...",42667467,564046,33508,180469,https://i.ytimg.com/vi/yDiXQl7grPQ/default.jpg,False,False,False,Yanny vs. Laurel audio illusion solved! PHEW F...
38084,KObL442PWhQ,18.31.05,Fish | Basics with Babish,Binging with Babish,24,2018-05-17T16:47:42.000Z,"basics with babish|""binging with babish""|""cook...",1206106,34709,588,2526,https://i.ytimg.com/vi/KObL442PWhQ/default.jpg,False,False,False,"On this episode of Basics, we're taking a look..."
38085,z6A2LHGx8_A,18.31.05,Sigrid - High Five (Official Video),SigridVEVO,10,2018-05-17T12:00:04.000Z,"Sigrid|""High""|""Five""|""Universal-Island""|""Recor...",2212447,57495,1159,3008,https://i.ytimg.com/vi/z6A2LHGx8_A/default.jpg,False,False,False,Listen to High Five here: https://Sigrid.lnk.t...
38086,H7gh2fmdjCU,18.31.05,Calum Scott - What I Miss Most (Official Video),CalumScottVEVO,10,2018-05-16T23:00:00.000Z,"Calum|""Scott""|""What""|""Miss""|""Most""|""Capitol""|""...",6379536,133606,1488,4460,https://i.ytimg.com/vi/H7gh2fmdjCU/default.jpg,False,False,False,Calum’s debut album ‘Only Human’ feat. “You Ar...
38087,TjXQzRWmb_I,18.31.05,Destination Wedding Trailer #1 (2018) | Moviec...,Movieclips Trailers,1,2018-05-16T20:29:17.000Z,"Destination Wedding|""Destination Wedding Trail...",4579845,54636,1704,6309,https://i.ytimg.com/vi/TjXQzRWmb_I/default.jpg,False,False,False,Check out the official Destination Wedding tra...
38088,SbjnIK6VEjc,18.31.05,Yanny or Laurel: Which do you hear?,CBS This Morning,25,2018-05-16T15:48:39.000Z,"CBS News|""CBS This Morning""|""Yanny VS Lauren""|...",3225560,11933,2016,8029,https://i.ytimg.com/vi/SbjnIK6VEjc/default.jpg,False,False,False,A talking robot has sparked a noisy online deb...


In [11]:
# Drop unwanted columns

to_drop =['Publish Time', 'Tags', 'Thumbnail Link', 'Comments Disabled', 'Ratings Disabled', 'Video Error Or Removed', 'Description']

yt_cleandata_df.drop(to_drop, inplace=True, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


In [12]:
# Replace the "." in "Trending Date" to "-"

yt_cleandata_df['Trending Date'] = [x.replace(".","-") for x in yt_cleandata_df['Trending Date']]
yt_cleandata_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,Video ID,Trending Date,Title,Channel Title,Category Titles,Views,Likes,Dislikes,Comment Count
0,2kyS6SvSYSE,17-14-11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,748374,57527,2966,15954
1,1ZAPwfrtAFY,17-14-11,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2418783,97185,6146,12703
2,5qpjK5DgCt4,17-14-11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,3191434,146033,5339,8181
3,puqaWrEC7tY,17-14-11,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,24,343168,10172,666,2146
4,d380meD0W0M,17-14-11,I Dare You: GOING BALD!?,nigahiga,24,2095731,132235,1989,17518


In [13]:
# Convert categories to string type to set up conversion of category titles
yt_cleandata_df['Category Titles'] = yt_cleandata_df['Category Titles'].apply(str)
yt_cleandata_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40379 entries, 0 to 40948
Data columns (total 9 columns):
Video ID           40379 non-null object
Trending Date      40379 non-null object
Title              40379 non-null object
Channel Title      40379 non-null object
Category Titles    40379 non-null object
Views              40379 non-null int64
Likes              40379 non-null int64
Dislikes           40379 non-null int64
Comment Count      40379 non-null int64
dtypes: int64(4), object(5)
memory usage: 3.1+ MB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [14]:
# Learn about the category titles to convert them
yt_cleandata_df['Category Titles'].value_counts()

24    9819
10    6437
26    4140
23    3435
22    3061
25    2409
28    2361
1     2340
17    2125
27    1642
15     916
20     803
19     402
2      379
43      57
29      53
Name: Category Titles, dtype: int64

In [15]:
# Convert Categories ID's to Actual Category Titles, opened the YouTube Categories Json File in VSC
# and manully translated. Didn't have enough time to use python to finish extracting categories from the json files

yt_cleandata_df['Category Titles'] = [x.replace("24","Entertainment") for x in yt_cleandata_df['Category Titles']]
yt_cleandata_df['Category Titles'] = [x.replace("10","Music") for x in yt_cleandata_df['Category Titles']]
yt_cleandata_df['Category Titles'] = [x.replace("26","How To & Style") for x in yt_cleandata_df['Category Titles']]
yt_cleandata_df['Category Titles'] = [x.replace("23","Comedy") for x in yt_cleandata_df['Category Titles']]
yt_cleandata_df['Category Titles'] = [x.replace("22","People & Blogs") for x in yt_cleandata_df['Category Titles']]
yt_cleandata_df['Category Titles'] = [x.replace("25","News & Politics") for x in yt_cleandata_df['Category Titles']]
yt_cleandata_df['Category Titles'] = [x.replace("28","Science & Technology") for x in yt_cleandata_df['Category Titles']]
yt_cleandata_df['Category Titles'] = [x.replace("1","Film & Animation") for x in yt_cleandata_df['Category Titles']]
yt_cleandata_df['Category Titles'] = [x.replace("17","Sports") for x in yt_cleandata_df['Category Titles']]
yt_cleandata_df['Category Titles'] = [x.replace("27","Education") for x in yt_cleandata_df['Category Titles']]
yt_cleandata_df['Category Titles'] = [x.replace("15","Pets & Animals") for x in yt_cleandata_df['Category Titles']]
yt_cleandata_df['Category Titles'] = [x.replace("20","Gaming") for x in yt_cleandata_df['Category Titles']]
yt_cleandata_df['Category Titles'] = [x.replace("19","Travel & Events") for x in yt_cleandata_df['Category Titles']]
yt_cleandata_df['Category Titles'] = [x.replace("2","Autos & Vehicles") for x in yt_cleandata_df['Category Titles']]
yt_cleandata_df['Category Titles'] = [x.replace("43","Shows") for x in yt_cleandata_df['Category Titles']]
yt_cleandata_df['Category Titles'] = [x.replace("29","Nonprofits & Activism") for x in yt_cleandata_df['Category Titles']]
yt_cleandata_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexin

Unnamed: 0,Video ID,Trending Date,Title,Channel Title,Category Titles,Views,Likes,Dislikes,Comment Count
0,2kyS6SvSYSE,17-14-11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,People & Blogs,748374,57527,2966,15954
1,1ZAPwfrtAFY,17-14-11,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,Entertainment,2418783,97185,6146,12703
2,5qpjK5DgCt4,17-14-11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,Comedy,3191434,146033,5339,8181
3,puqaWrEC7tY,17-14-11,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,Entertainment,343168,10172,666,2146
4,d380meD0W0M,17-14-11,I Dare You: GOING BALD!?,nigahiga,Entertainment,2095731,132235,1989,17518
5,gHZ1Qz0KiKM,17-14-11,2 Weeks with iPhone X,iJustine,Science & Technology,119180,9763,511,1434
6,39idVpFF7NQ,17-14-11,Roy Moore & Jeff Sessions Cold Open - SNL,Saturday Night Live,Entertainment,2103417,15993,2445,1970
7,nc99ccSXST0,17-14-11,5 Ice Cream Gadgets put to the Test,CrazyRussianHacker,Science & Technology,817732,23663,778,3432
8,jr9QtXwC9vc,17-14-11,The Greatest Showman | Official Trailer 2 [HD]...,20th Century Fox,Film & Animation,826059,3543,119,340
9,TUmyygCMMGA,17-14-11,Why the rise of the robots won’t mean the end ...,Vox,News & Politics,256426,12654,1363,2368


In [16]:
# Sort the values by Trending Date
yt_cleandata_df = yt_cleandata_df.sort_values(by= ["Trending Date"], ascending=False)
yt_cleandata_df.head()

Unnamed: 0,Video ID,Trending Date,Title,Channel Title,Category Titles,Views,Likes,Dislikes,Comment Count
38051,EqeIRzY7hIU,18-31-05,6 Cheese Gadgets put to the Test!,CrazyRussianHacker,Science & Technology,1519038,26770,1557,2875
38103,MAjY8mCTXWk,18-31-05,"周杰倫 Jay Chou【不愛我就拉倒 If You Don't Love Me, It's...",杰威爾音樂 JVR Music,Music,17259071,132009,9552,14789
38081,LzuDyq0-1LM,18-31-05,Why it's not a British royal wedding without f...,Vox,News & Politics,484199,8246,662,761
38082,6SuMbFuKDf8,18-31-05,Backstreet Boys - Don't Go Breaking My Heart (...,BackstreetBoysVEVO,Music,14717193,396826,16015,39035
38083,yDiXQl7grPQ,18-31-05,Do You Hear Yanny or Laurel? (SOLVED with SCIE...,AsapSCIENCE,Science & Technology,42667467,564046,33508,180469


In [17]:
# Filter out for Items with the Music Category
yt_musicdata_df = yt_cleandata_df.loc[yt_cleandata_df['Category Titles'] == 'Music']
yt_musicdata_df.head()

Unnamed: 0,Video ID,Trending Date,Title,Channel Title,Category Titles,Views,Likes,Dislikes,Comment Count
38103,MAjY8mCTXWk,18-31-05,"周杰倫 Jay Chou【不愛我就拉倒 If You Don't Love Me, It's...",杰威爾音樂 JVR Music,Music,17259071,132009,9552,14789
38082,6SuMbFuKDf8,18-31-05,Backstreet Boys - Don't Go Breaking My Heart (...,BackstreetBoysVEVO,Music,14717193,396826,16015,39035
38085,z6A2LHGx8_A,18-31-05,Sigrid - High Five (Official Video),SigridVEVO,Music,2212447,57495,1159,3008
38086,H7gh2fmdjCU,18-31-05,Calum Scott - What I Miss Most (Official Video),CalumScottVEVO,Music,6379536,133606,1488,4460
38096,MBR2kxt7RK8,18-31-05,Taylor Swift - Delicate (Vertical Version),TaylorSwiftVEVO,Music,2188738,114913,4587,7233


In [18]:
# Print out All Channel Titles to see what needs to be cleaned
for x in yt_musicdata_df['Channel Title'].unique():
     print(x)

杰威爾音樂 JVR Music
BackstreetBoysVEVO
SigridVEVO
CalumScottVEVO
TaylorSwiftVEVO
NickiMinajAtVEVO
Daddy Yankee
SZAVEVO
ibighit
Dan And Shay
FifthHarmonyVEVO
davematthewsbandVEVO
EnriqueIglesiasVEVO
Gallant
ChildishGambinoVEVO
Cardi B
SamSmithWorldVEVO
MeghanTrainorVEVO
Kelly Clarkson
Charlie Puth
johnmayerVEVO
SiaVEVO
BANGTANTV
weezer
SumerianRecords
Diplo
Rudy Mancuso
MustardVEVO
ShawnMendesVEVO
NickyJamTV
JenniferLopezVEVO
John Mayer
AzealiaBanksVEVO
H O N N E
ArianaGrandeVevo
BBCRadio1VEVO
Panic! At The Disco
Lauv
CAguileraVEVO
ZaynVEVO
Maroon5VEVO
Big Marvel
Rob Scallon
jypentertainment
5SOSVEVO
fantano
Mike Shinoda
LadyGagaVEVO
Maggie Lindemann
HalseyVEVO
ToniBraxtonVEVO
Atlantic Records
JasonAldeanVEVO
Cimorelli
ANDREW HUANG
CJENMMUSIC Official
Shawn Mendes
Alan Walker
PTXofficial
KeithUrbanVEVO
Shoshana Bean
KaceyMusgravesVEVO
David Guetta
ChrisStapletonVEVO
The Weeknd
ThirtySecondsToMarsVEVO
Rudimental
David Archuleta
ThreeDaysGraceVEVO
Spinach Dippa
dodieVEVO
iKON
JackWhiteVEVO
Ja

In [19]:
# Insert a new column for Artist. This will be used to join the Spotify 2018 Top 100 Artists
yt_music_data_df = yt_musicdata_df.insert(3, "Artist", "")

In [20]:
#yt_musicdata_df

In [21]:
# Clean Channel Title Column to set up the MAIN LOOP

yt_musicdata_df['Channel Title'] = [x.replace("VEVO","") for x in yt_musicdata_df['Channel Title']]
yt_musicdata_df['Channel Title'] = [x.replace("vevo","") for x in yt_musicdata_df['Channel Title']]
yt_musicdata_df['Channel Title'] = [x.replace("Vevo","") for x in yt_musicdata_df['Channel Title']]
yt_musicdata_df['Channel Title'] = [x.replace("Official","") for x in yt_musicdata_df['Channel Title']]
yt_musicdata_df['Channel Title'] = [x.replace("official","") for x in yt_musicdata_df['Channel Title']]
yt_musicdata_df['Channel Title'] = [x.replace("OFFICIAL","") for x in yt_musicdata_df['Channel Title']]
yt_musicdata_df['Channel Title'] = [x.replace("You Tube Channel","") for x in yt_musicdata_df['Channel Title']]
yt_musicdata_df['Channel Title'] = [x.replace("Music","") for x in yt_musicdata_df['Channel Title']]
yt_musicdata_df['Channel Title'] = [x.replace("music","") for x in yt_musicdata_df['Channel Title']]
yt_musicdata_df['Channel Title'] = [x.replace(" - Topic","") for x in yt_musicdata_df['Channel Title']]

yt_musicdata_df['Channel Title'].replace("BackstreetBoys","Backstreet Boys")
yt_musicdata_df['Channel Title'].replace("CalumScott","Calum Scott")
yt_musicdata_df['Channel Title'].replace("TaylorSwift","Taylor Swift")
yt_musicdata_df['Channel Title'].replace("NickiMinajAt","Nicki Minaj")
yt_musicdata_df['Channel Title'].replace("FifthHarmony","FifthHarmony")
yt_musicdata_df['Channel Title'].replace("davematthewsband","Dave Matthews Band")
yt_musicdata_df['Channel Title'].replace("EnriqueIglesias","Enrique Iglesias")
yt_musicdata_df['Channel Title'].replace("ChildishGambino","Childish Gambino")
yt_musicdata_df['Channel Title'].replace("SamSmithWorld","Sam Smith")
yt_musicdata_df['Channel Title'].replace("MeghanTrainor","Meghan Trainor")
yt_musicdata_df['Channel Title'].replace("johnmayer","John Mayer")
yt_musicdata_df['Channel Title'].replace("weezer","Weezer")
yt_musicdata_df['Channel Title'].replace("AzealiaBanks","Azealia Banks")
yt_musicdata_df['Channel Title'].replace("Maroon5","Maroon 5")
yt_musicdata_df['Channel Title'].replace("Zayn","ZAYN")
yt_musicdata_df['Channel Title'].replace("ArianaGrande","Ariana Grande")
yt_musicdata_df['Channel Title'].replace("CAguilera","Christina Aguilera")
yt_musicdata_df['Channel Title'].replace("LadyGaga","Lady Gaga")
yt_musicdata_df['Channel Title'].replace("ToniBraxton","Toni Braxton")
yt_musicdata_df['Channel Title'].replace("JasonAldean","Jason Aldean")
yt_musicdata_df['Channel Title'].replace("PTXofficial","PTX")
yt_musicdata_df['Channel Title'].replace("KeithUrban","Keith Urban")
yt_musicdata_df['Channel Title'].replace("KaceyMusgraves","Kacey Musgraves")
yt_musicdata_df['Channel Title'].replace("ChrisStapleton","Chris Stapleton")
yt_musicdata_df['Channel Title'].replace("ThirtySecondsToMars","Thirty Seconds To Mars")


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the ca

38103          杰威爾音樂 JVR 
38082      BackstreetBoys
38085              Sigrid
38086          CalumScott
38096         TaylorSwift
38056        NickiMinajAt
38069        Daddy Yankee
38070                 SZA
38072             ibighit
38074        Dan And Shay
38075        FifthHarmony
38133    davematthewsband
38137     EnriqueIglesias
38145             Gallant
38146     ChildishGambino
37951             Cardi B
38130       SamSmithWorld
38129       MeghanTrainor
38116      Kelly Clarkson
38121        Charlie Puth
38125           johnmayer
38126                 Sia
37987           BANGTANTV
37962              weezer
37977     SumerianRecords
38037               Diplo
38042        Rudy Mancuso
38044             Mustard
38047         ShawnMendes
38007          NickyJamTV
               ...       
3621             Snapchat
3625        Martin Garrix
3620        Shoshana Bean
3619                 5FDP
3617            TheJuicyJ
3464               HANSON
3462            KaneBrown
3494        

In [22]:
# Test Loop to convert spotify artist names to lower case

for x in spotify2018_rawdata_df['artists'].unique():
    str = x.lower().replace(" ", "")
    print(x, " | ", str)

Drake  |  drake
XXXTENTACION  |  xxxtentacion
Post Malone  |  postmalone
Cardi B  |  cardib
Calvin Harris  |  calvinharris
Dua Lipa  |  dualipa
Marshmello  |  marshmello
Camila Cabello  |  camilacabello
Juice WRLD  |  juicewrld
Maroon 5  |  maroon5
Zedd  |  zedd
Kendrick Lamar  |  kendricklamar
Ariana Grande  |  arianagrande
Nicky Jam  |  nickyjam
BlocBoy JB  |  blocboyjb
Rudimental  |  rudimental
Nio Garcia  |  niogarcia
Bazzi  |  bazzi
5 Seconds of Summer  |  5secondsofsummer
Ed Sheeran  |  edsheeran
Khalid  |  khalid
Bebe Rexha  |  beberexha
Tyga  |  tyga
Clean Bandit  |  cleanbandit
Dennis Lloyd  |  dennislloyd
Luis Fonsi  |  luisfonsi
benny blanco  |  bennyblanco
Selena Gomez  |  selenagomez
Dynoro  |  dynoro
Eminem  |  eminem
Daddy Yankee  |  daddyyankee
Travis Scott  |  travisscott
Imagine Dragons  |  imaginedragons
Reik  |  reik
Ti?sto  |  ti?sto
Bruno Mars  |  brunomars
NF  |  nf
The Weeknd  |  theweeknd
Offset  |  offset
Sam Smith  |  samsmith
Lil Dicky  |  lildicky
6ix9ine  

In [23]:
# MAIN LOOP: Loop through both YouTube and Spotify Data Sets, Normalize them (lower case and remove spaces),
# and fill in the newly "Artist" column in the YouTube DF with the Spotify Artist Value if the artist name is found

for index, x in yt_musicdata_df.iterrows():
    stryt = x['Channel Title'].lower().replace(" ", "")
#     yt_musicdata_df['Artist'][index] = 
#     print(x['Channel Title'], " | ", stryt)
    
    for y in spotify2018_rawdata_df['artists'].unique():
        str = y.lower().replace(" ", "")
        #print(y, " | ", str)
        if str in stryt:
            yt_musicdata_df['Artist'][index] = y

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  exec(code_obj, self.user_global_ns, self.user_ns)


In [24]:
yt_musicdata_df

Unnamed: 0,Video ID,Trending Date,Title,Artist,Channel Title,Category Titles,Views,Likes,Dislikes,Comment Count
38103,MAjY8mCTXWk,18-31-05,"周杰倫 Jay Chou【不愛我就拉倒 If You Don't Love Me, It's...",,杰威爾音樂 JVR,Music,17259071,132009,9552,14789
38082,6SuMbFuKDf8,18-31-05,Backstreet Boys - Don't Go Breaking My Heart (...,,BackstreetBoys,Music,14717193,396826,16015,39035
38085,z6A2LHGx8_A,18-31-05,Sigrid - High Five (Official Video),,Sigrid,Music,2212447,57495,1159,3008
38086,H7gh2fmdjCU,18-31-05,Calum Scott - What I Miss Most (Official Video),,CalumScott,Music,6379536,133606,1488,4460
38096,MBR2kxt7RK8,18-31-05,Taylor Swift - Delicate (Vertical Version),,TaylorSwift,Music,2188738,114913,4587,7233
38056,2in8XqiElwc,18-31-05,Nicki Minaj - Chun-Li (Live on SNL / 2018),,NickiMinajAt,Music,4455275,181363,9010,18333
38069,n_W54baizX8,18-31-05,Daddy Yankee - Hielo (Video Oficial),Daddy Yankee,Daddy Yankee,Music,28851415,581378,38243,36509
38070,pcJo0tIWybY,18-31-05,SZA - Garden (Say It Like Dat) (Official Video),,SZA,Music,4948789,201011,3815,14690
38072,7C2z4GqqS5E,18-31-05,BTS (방탄소년단) 'FAKE LOVE' Official MV,,ibighit,Music,121219886,5595203,205565,1225326
38074,7UoP9ABJXGE,18-31-05,Dan + Shay - Speechless (Wedding Video),,Dan And Shay,Music,2617227,31383,928,621


In [25]:
# Created a new filtered df for the music taking out all blank "Artist".
# All items in this DF returned were on Spotify's Top 100 Songs in 2018

yt_filtered_musicdata_df = yt_musicdata_df.loc[yt_musicdata_df['Artist'] != '']
yt_filtered_musicdata_df = yt_filtered_musicdata_df.reset_index(drop=True)

In [26]:
# Set up Spotify DataFrame
spotify_2018_id = spotify2018_rawdata_df['id']
spotify_2018_name = spotify2018_rawdata_df['name']
spotify_2018_artists = spotify2018_rawdata_df['artists']

In [27]:
#type(spotify_2018_id)

In [28]:
spotify2018_filtered_df = pd.DataFrame({"Artist": spotify_2018_artists,
              "Song Name": spotify_2018_name,
              "Spotify Unique ID": spotify_2018_id
             })

In [29]:
spotify2018_filtered_df.head()

Unnamed: 0,Artist,Song Name,Spotify Unique ID
0,Drake,God's Plan,6DCZcSspjsKoFjzjrWoCd
1,XXXTENTACION,SAD!,3ee8Jmje8o58CHK66QrVC
2,Post Malone,rockstar (feat. 21 Savage),0e7ipj03S05BNilyu5bRz
3,Post Malone,Psycho (feat. Ty Dolla $ign),3swc6WTsr7rl9DqQKQA55
4,Drake,In My Feelings,2G7V7zsVDxg1yRsu7Ew9R


## Step 4) Load both dataframes into MySQL database

In [30]:
# COPY THE YouTube DATA FRAME
youtube_music_artist_df = yt_filtered_musicdata_df.copy()
youtube_music_artist_df

Unnamed: 0,Video ID,Trending Date,Title,Artist,Channel Title,Category Titles,Views,Likes,Dislikes,Comment Count
0,n_W54baizX8,18-31-05,Daddy Yankee - Hielo (Video Oficial),Daddy Yankee,Daddy Yankee,Music,28851415,581378,38243,36509
1,xTlNMmZKwpA,18-31-05,"Cardi B, Bad Bunny & J Balvin - I Like It [Off...",Cardi B,Cardi B,Music,20723565,1018785,48090,68790
2,8h--kFui1JA,18-31-05,Sam Smith - Pray (Official Video) ft. Logic,Sam Smith,SamSmithWorld,Music,17424422,340361,8724,12041
3,NjMd89dggAk,18-31-05,Shawn Mendes - In My Blood (Live From The Bill...,Shawn Mendes,ShawnMendes,Music,490905,23512,1017,831
4,kFMZUxX6K6o,18-31-05,Live It Up - Nicky Jam feat. Will Smith & Era ...,Nicky Jam,NickyJamTV,Music,11413665,223896,40583,31865
5,vRf3azp1pak,18-31-05,Ariana Grande - No Tears Left To Cry (Live On ...,Ariana Grande,ArianaGrande,Music,2060402,124928,1902,6868
6,UhiXEgqhBWs,18-31-05,Lauv - Bracelet [Official Audio],Lauv,Lauv,Music,911288,33862,239,1397
7,voG07pt-KYI,18-31-05,ZAYN - Entertainer (Official Video),ZAYN,Zayn,Music,13095486,556645,14121,35488
8,aJOTlE1K90k,18-31-05,Maroon 5 - Girls Like You ft. Cardi B,Maroon 5,Maroon5,Music,3057987,406604,4520,31301
9,o-BODsf2T68,18-31-03,Lauv - Chasing Fire [Official Audio],Lauv,Lauv,Music,807622,47920,335,2041


In [31]:
# COPY THE Spotify 2018 Top 100 Filtered DATA FRAME
# create a new variable and copy the finished spotify dataframe
spotify_2018_filtered_df = spotify2018_filtered_df.copy()
spotify_2018_filtered_df.head()

Unnamed: 0,Artist,Song Name,Spotify Unique ID
0,Drake,God's Plan,6DCZcSspjsKoFjzjrWoCd
1,XXXTENTACION,SAD!,3ee8Jmje8o58CHK66QrVC
2,Post Malone,rockstar (feat. 21 Savage),0e7ipj03S05BNilyu5bRz
3,Post Malone,Psycho (feat. Ty Dolla $ign),3swc6WTsr7rl9DqQKQA55
4,Drake,In My Feelings,2G7V7zsVDxg1yRsu7Ew9R


In [32]:
# Ready the "rds_connection_string" = "<inser user name>:<insert password>@127.0.0.1/customer_db"
rds_connection_string = "root:******************@127.0.0.1/youtube_spotify_2018_db"
engine = create_engine(f'mysql://{rds_connection_string}')

In [None]:
# engine.set_character_set('utf8')
# engine.execute('SET NAMES utf8;')
# dbc.execute('SET CHARACTER SET utf8;')
# dbc.execute('SET character_set_connection=utf8;')

In [34]:
youtube_music_artist_df['Channel Title'].unique()
youtube_music_artist_df

Unnamed: 0,Video ID,Trending Date,Title,Artist,Channel Title,Category Titles,Views,Likes,Dislikes,Comment Count
0,n_W54baizX8,18-31-05,Daddy Yankee - Hielo (Video Oficial),Daddy Yankee,Daddy Yankee,Music,28851415,581378,38243,36509
1,xTlNMmZKwpA,18-31-05,"Cardi B, Bad Bunny & J Balvin - I Like It [Off...",Cardi B,Cardi B,Music,20723565,1018785,48090,68790
2,8h--kFui1JA,18-31-05,Sam Smith - Pray (Official Video) ft. Logic,Sam Smith,SamSmithWorld,Music,17424422,340361,8724,12041
3,NjMd89dggAk,18-31-05,Shawn Mendes - In My Blood (Live From The Bill...,Shawn Mendes,ShawnMendes,Music,490905,23512,1017,831
4,kFMZUxX6K6o,18-31-05,Live It Up - Nicky Jam feat. Will Smith & Era ...,Nicky Jam,NickyJamTV,Music,11413665,223896,40583,31865
5,vRf3azp1pak,18-31-05,Ariana Grande - No Tears Left To Cry (Live On ...,Ariana Grande,ArianaGrande,Music,2060402,124928,1902,6868
6,UhiXEgqhBWs,18-31-05,Lauv - Bracelet [Official Audio],Lauv,Lauv,Music,911288,33862,239,1397
7,voG07pt-KYI,18-31-05,ZAYN - Entertainer (Official Video),ZAYN,Zayn,Music,13095486,556645,14121,35488
8,aJOTlE1K90k,18-31-05,Maroon 5 - Girls Like You ft. Cardi B,Maroon 5,Maroon5,Music,3057987,406604,4520,31301
9,o-BODsf2T68,18-31-03,Lauv - Chasing Fire [Official Audio],Lauv,Lauv,Music,807622,47920,335,2041


In [35]:
# Clean Special Characters to prevent latin-1 encoding errors. Went back up to the pd.read_csv
# and added "encoding="utf-8"

youtube_music_artist_df['Title'] = [x.replace("é","e") for x in youtube_music_artist_df['Title']]
youtube_music_artist_df['Title'] = [x.replace("ú","u") for x in youtube_music_artist_df['Title']]
youtube_music_artist_df['Title'] = [x.replace("®","") for x in youtube_music_artist_df['Title']]

In [36]:
# Use pandas to load csv converted DataFrame into database - YouTube
youtube_music_artist_df.to_sql(name='youtube_music_2018', con=engine, if_exists='append', index=False)

In [37]:
# Use pandas to load csv converted DataFrame into database - Spotify 2018
spotify_2018_filtered_df.to_sql(name='spotify2018_top100', con=engine, if_exists='append', index=False)

In [43]:
# CHECK FOR TABLES
engine.table_names()

['spotify2018_top100', 'youtube_music_2018']

In [39]:
# Confirm data has been added by querying the you tube table
pd.read_sql_query('select * from youtube_music_2018', con=engine).head()

Unnamed: 0,Video ID,Trending Date,Title,Artist,Channel Title,Category Titles,Views,Likes,Dislikes,Comment Count
0,n_W54baizX8,18-31-05,Daddy Yankee - Hielo (Video Oficial),Daddy Yankee,Daddy Yankee,Music,28851415,581378,38243,36509
1,xTlNMmZKwpA,18-31-05,"Cardi B, Bad Bunny & J Balvin - I Like It [Off...",Cardi B,Cardi B,Music,20723565,1018785,48090,68790
2,8h--kFui1JA,18-31-05,Sam Smith - Pray (Official Video) ft. Logic,Sam Smith,SamSmithWorld,Music,17424422,340361,8724,12041
3,NjMd89dggAk,18-31-05,Shawn Mendes - In My Blood (Live From The Bill...,Shawn Mendes,ShawnMendes,Music,490905,23512,1017,831
4,kFMZUxX6K6o,18-31-05,Live It Up - Nicky Jam feat. Will Smith & Era ...,Nicky Jam,NickyJamTV,Music,11413665,223896,40583,31865


In [41]:
# Confirm data has been added by querying the spotify table
pd.read_sql_query('select * from spotify2018_top100', con=engine).head()

Unnamed: 0,Artist,Song Name,Spotify Unique ID
0,Drake,God's Plan,6DCZcSspjsKoFjzjrWoCd
1,XXXTENTACION,SAD!,3ee8Jmje8o58CHK66QrVC
2,Post Malone,rockstar (feat. 21 Savage),0e7ipj03S05BNilyu5bRz
3,Post Malone,Psycho (feat. Ty Dolla $ign),3swc6WTsr7rl9DqQKQA55
4,Drake,In My Feelings,2G7V7zsVDxg1yRsu7Ew9R


## Finished!