### Why there are so many `None` in the `songplays` table?

In [1]:
import os
import glob
import pandas as pd

In [2]:
# Define a function return concatenate all the data into one dataframe.

def process_file(filepath):
    all_files = []
    for root, dirs, files in os.walk(filepath):
        files = glob.glob(os.path.join(root,'*.json'))
        for f in files :
            all_files.append(os.path.abspath(f))

    # get total number of files found
    num_files = len(all_files)
    print('{} files found in {}'.format(num_files, filepath))
    file=all_files[0]
    df_new = pd.read_json(file,lines=True)
    columns=df_new.columns
    
    df = pd.DataFrame(columns=columns)
    for file in all_files:
        df_new = pd.read_json(file,lines=True)
        df = pd.concat([df,df_new],ignore_index=True)
    return df


In [3]:
df_song =process_file('data/song_data') # store all the song_data into one dataframe

71 files found in data/song_data


In [4]:
df_song.head()

Unnamed: 0,num_songs,artist_id,artist_latitude,artist_longitude,artist_location,artist_name,song_id,title,duration,year
0,1,ARD7TVE1187B99BFB1,,,California - LA,Casual,SOMZWCG12A8C13C480,I Didn't Mean To,218.93179,0
1,1,ARMJAGH1187FB546F3,35.14968,-90.04892,"Memphis, TN",The Box Tops,SOCIWDW12A8C13D406,Soul Deep,148.03546,1969
2,1,ARKRRTF1187B9984DA,,,,Sonora Santanera,SOXVLOJ12AB0189215,Amor De Cabaret,177.47546,0
3,1,AR7G5I41187FB4CE6C,,,"London, England",Adam Ant,SONHOTT12A8C13493C,Something Girls,233.40363,1982
4,1,ARXR32B1187FB57099,,,,Gob,SOFSOCN12A8C143F5D,Face the Ashes,209.60608,2007


In [5]:
df_log =process_file('data/log_data') # store all the log_data into one dataframe

30 files found in data/log_data


In [6]:
df_log.head()

Unnamed: 0,artist,auth,firstName,gender,itemInSession,lastName,length,level,location,method,page,registration,sessionId,song,status,ts,userAgent,userId
0,,Logged In,Walter,M,0,Frye,,free,"San Francisco-Oakland-Hayward, CA",GET,Home,1540919166796,38,,200,1541105830796,"""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4...",39
1,,Logged In,Kaylee,F,0,Summers,,free,"Phoenix-Mesa-Scottsdale, AZ",GET,Home,1540344794796,139,,200,1541106106796,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",8
2,Des'ree,Logged In,Kaylee,F,1,Summers,246.30812,free,"Phoenix-Mesa-Scottsdale, AZ",PUT,NextSong,1540344794796,139,You Gotta Be,200,1541106106796,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",8
3,,Logged In,Kaylee,F,2,Summers,,free,"Phoenix-Mesa-Scottsdale, AZ",GET,Upgrade,1540344794796,139,,200,1541106132796,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",8
4,Mr Oizo,Logged In,Kaylee,F,3,Summers,144.03873,free,"Phoenix-Mesa-Scottsdale, AZ",PUT,NextSong,1540344794796,139,Flat 55,200,1541106352796,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",8


In [7]:
print(df_song.shape, df_log.shape)

(71, 10) (8056, 18)


In [8]:
song_set = set(df_song['title'])
song_log_set = set(df_log['song'])
inter = song_set.intersection(song_log_set)
print("df_song contains {} songs.".format(len(song_set)))
print("df_log contains {} songs.".format(len(song_log_set)))
print("Only {} songs in both of the two dataframes.".format(len(inter)))

df_song contains 71 songs.
df_log contains 5190 songs.
Only 2 songs in both of the two dataframes.


### Conclusion
Sparkify provides only two datasets:
* song dataset, which contains 71 different songs,
* log dataset, which contains 5190 different songs.

Therefore, for the songs in log dataset, we can not find their information, like `song_id` and `artist_id` which are stored in the song dataset. Apparently, Sparkify does not provid enough data for the song data. 