## EDA using Million Playlist Database in SQLite
**Author:** Jamilla Akhund-Zade

###Introduction:
The Million Playlist Data (MPD) in its raw format is made up of CSVs 0 through 999, each with 1000 playlists. The raw MPD was merged into a single CSV file called 'combined_playlist_v2.csv'. This CSV file has an extra row denoting what CSV file the playlist came from.

The columns of the MPD are as follows:

1. **pid:** playlist ID within the original CSV
2. **pos:** position of song within the playlist
3. **artist_name:** name of the artist (string)
4. **track_uri:** unique track Spotify Identifier
5. **artist_uri:** unique artist Spotify Identifier
6. **track_name:** name of the track (string)
7. **album_uri:** unique album Spotify Identifier
8. **duration_ms:** track duration in ms
9. **album_name:** name of album
10. **file_name:** identifier of CSV where playlists came from
11. **pidfile_name:** unique playlist identifier

The MPD is 12GB, which is too large to hold in memory, so I will create an SQLite database so that I can do EDA efficiently, while the data is held out-of-memory. I will use the MacOS built-in SQLite command-line tools:

```
> sqlite3
sqlite> .open mpd.db #open brand-new database
sqlite> CREATE TABLE mpd(
    pid INTEGER,
    pos INTEGER,
    artist_name TEXT,
    track_uri TEXT,
    artist_uri TEXT,
    track_name TEXT,
    album_uri TEXT,
    duration_ms INTEGER,
    album_name TEXT,
    file_name TEXT,
    pidfile_name TEXT
);
sqlite> .mode csv #set mode to csv
sqlite> .import combined_playlist_v2.csv mpd #import csv data to table mpd

```

In [1]:
#load libraries
import pandas as pd
import sqlite3

In [14]:
#test query
con = sqlite3.connect("/Users/jamillaakhund-zade/CS109A/Spotify_Project/data/mpd.db")
df = pd.read_sql_query("SELECT * FROM mpd LIMIT 3", con) #select top 3 rows

df.head()

Unnamed: 0,pid,pos,artist_name,track_uri,artist_uri,track_name,album_uri,duration_ms,album_name,file_name,pidfile_name
0,0,0,Sleeping At Last,spotify:track:2d7LPtieXdIYzf7yHPooWd,spotify:artist:0MeLMJJcouYXCymQSHPn8g,Chasing Cars,spotify:album:0UIIvTTWNB3gRQWFoxoEDh,242564,"Covers, Vol. 2",songs284,0songs284
1,0,1,Rachael Yamagata,spotify:track:0y4TKcc7p2H6P0GJlt01EI,spotify:artist:7w0qj2HiAPIeUcoPogvOZ6,Elephants,spotify:album:6KzK9fDNmj7GHFbcE4gVJD,253701,Elephants...Teeth Sinking Into Heart,songs284,0songs284
2,0,2,The Cinematic Orchestra,spotify:track:6q4c1vPRZREh7nw3wG7Ixz,spotify:artist:32ogthv0BdaSMPml02X9YB,That Home,spotify:album:5cPHT4yMCfETLRYAoBFcOZ,103920,Ma Fleur,songs284,0songs284


In [17]:
print(df.dtypes)

pid              int64
pos              int64
artist_name     object
track_uri       object
artist_uri      object
track_name      object
album_uri       object
duration_ms      int64
album_name      object
file_name       object
pidfile_name    object
dtype: object
