# YouTube & Spotify Top Music Artists From 2018

## SCOPE:
### - Extracted, transformed, and loaded up YouTube's Top Trending Videos from December 2017 thru May 2018 for their videos categorized as music only, and created an "Artist" column to enable joining with Spotify's Top 100 Songs of 2018. Both dataframes were loaded into MySQL.


## PURPOSE:
### - I choose this project because I'm a avid listener and a huge music and concert goer, and wanted to work with data that I was familiar with.

### Data Sources - Kaggle
 - https://www.kaggle.com/datasnaek/youtube-new/downloads/youtube-new.zip/114
 - https://www.kaggle.com/nadintamer/top-spotify-tracks-of-2018
 


In [1]:
# Import Dependencies:
import os
import csv
import json
import simplejson
import numpy as np
import pandas as pd
from datetime import datetime
import sys
import string
from sqlalchemy import create_engine, Column, Integer, String, join
from sqlalchemy_utils import database_exists, create_database, drop_database, has_index
import pymysql

In [2]:
#rds_connection_string = "<inser user name>:<insert password>@127.0.0.1/customer_db"
rds_connection_string = "root:gREATNESS23$@127.0.0.1/" #youtube_spotify_2018_db"
engine = create_engine(f'mysql://{rds_connection_string}')

In [3]:
# Use SQL Alchemy to search all of my databases in MySQL:

# Can set up an input for the db_name later (optional)
db_name = 'youtube_spotify_2018_db2'

db_exist = database_exists(f'mysql://{rds_connection_string}youtube_spotify_2018_db2')
db_url = f'mysql://{rds_connection_string}youtube_spotify_2018_db2'

if db_exist == True:
    drop_table_y_or_n = input(f'"{db_name}" database already exists in MySQL. Do you want you drop the table? Enter exactly: "y" or "n".  ')
    if drop_table_y_or_n == 'y':
        drop_database(db_url)
        print(f"Database {db_name} was dropped")
        create_new_db = input(f"Do you want to create another database called: {db_name}?  ")
        if create_new_db == 'y':
            create_database(db_url)
            print(f"The database {db_name} was created Next You will need to create tables for this database.  ")
        else:
            print("No database was created. Goodbye!  ")
    else:
        print("The database exists. No action was taken. Goodbye!  ")
else:
    create_database(db_url)
    print(f"The queried database did not exist, and was created as: {db_name} .  ")
    


youtube_spotify_2018_db2 database already exists in MySQL. Do you want you drop the table? Enter exactly: "y" or "n".y
Database youtube_spotify_2018_db2 was dropped
Do you want to create another database called: youtube_spotify_2018_db2?y
The database youtube_spotify_2018_db2 was created Next You will need to create tables for this database.


## Extract and Transform all of YouTube Top Trending Videos

In [None]:
# YouTube data has two parts: 1) Categories information in JSON format
                            # 2) Top Trending US YouTube Videos in a CSV file

# Part 1) YouTube Categories are seperated in a json file
yt_json_file = './resources/youtube_US_category_id.json'
yt_rawjson_df = pd.read_json(yt_json_file)

In [None]:
# Extract the category id and category titles, and set them into a list

# for i in yt_rawjson_df['items']:
#     #print(i['id'])
#     print(i['id'] + ' | ' + i['snippet']['title'])
    

category_id = [i['id'] for i in yt_rawjson_df['items']]
category_title = [i['snippet']['title'] for i in yt_rawjson_df['items']]

# Create a dataframe of the category id and title for later use
category_id_title_df = pd.DataFrame({'category_id': category_id, 'category_title': category_title})
category_id_title_df.head()

In [None]:
# Part 1) is the YouTube Top US Videos in a CSV
csv_file_yt = "./resources/youtube_USvideos.csv"
yt_rawdata_df = pd.read_csv(csv_file_yt)

In [None]:
# view rows, count and datatypes
yt_rawdata_df.info()

In [None]:
# Rename Columns
yt_cleandata_df = yt_rawdata_df.rename(columns={"video_id":"Video ID", "trending_date":"Trending Date",
                                                "title":"Title", "channel_title":"Channel Title",
                                                "category_id":"Category Titles", "publish_time":"Publish Time",
                                                "tags":"Tags", "views":"Views",
                                                "likes":"Likes", "dislikes":"Dislikes", 
                                                "comment_count":"Comment Count", "thumbnail_link":"Thumbnail Link",
                                                "comments_disabled":"Comments Disabled", "ratings_disabled":"Ratings Disabled",
                                                "video_error_or_removed":"Video Error Or Removed", "description":"Description"
                                               })
yt_cleandata_df.head()

In [None]:
# Drop Cells with Missing Information
yt_cleandata_df = yt_cleandata_df.dropna(how="any")

In [None]:
# Drop Dulplicates and Sort by Trending Date
yt_cleandata_df.drop_duplicates(['Video ID', 'Trending Date', 'Title', 'Channel Title', 'Category Titles', 'Publish Time']).sort_values(by=['Trending Date'], ascending=False).head()

In [None]:
# Drop Unwanted Columns

to_drop =['Publish Time', 'Tags', 'Thumbnail Link', 'Comments Disabled', 'Ratings Disabled', 'Video Error Or Removed', 'Description']

yt_cleandata_df.drop(to_drop, inplace=True, axis=1)

In [None]:
# Replace the "." in Trending Date to "-"

yt_cleandata_df['Trending Date'] = [x.replace(".","-") for x in yt_cleandata_df['Trending Date']]
yt_cleandata_df.head()