# Song Challenge
## Goal
Company XYZ is a very early stage startup. They allow people to stream music from their mobile
for free. Right now, they still only have songs from the Beatles in their music collection, but they
are planning to expand soon.
They still have all their data in json files and they are interested in getting some basic info about
their users as well as building a very preliminary song recommendation model in order to
increase user engagement.
Working with 􀁑son files is important􀀕 􀀰f you join a very early stage start-up, they might not have a
nice database and all data will be in jsons. Third party data are often stored in json files as well.
## Challenge Description
You are the fifth employee at company XYZ. The good news is that if the company becomes
big, you will become very rich with the stocks. The bad news is that, at such an early stage, the
data is usually very messy. All their data is stored in json files.
The company CEO asked you very specific questions:

#### 1-What are the top 3 and the bottom 3 states in terms of number of users?
#### 2-What are the top 3 and the bottom 3 states in terms of user engagement? You can choose how to mathematically define user engagement. What the CEO cares about here is in which states users are using the product a lot/very little.
#### 3-The CEO wants to send a gift to the first user who signed-up for each state. That is, the first user who signed-up from California, from Oregon, etc. Can you give him a list of those users?
#### 4-Build a function that takes as an input any of the songs in the data and returns the most likely song to be listened next. That is, if, for instance, a user is currently listening to "Eight Days A Week", which song has the highest probability of being played right after it by the same user? This is going to be v1 of a song recommendation model.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
song = pd.read_json('song.json')

In [4]:
song.head()

Unnamed: 0,id,user_id,user_state,user_sign_up_date,song_played,time_played
0,GOQMMKSQQH,122,Louisiana,2015-05-16,Hey Jude,2015-06-11 21:51:35
1,HWKKBQKNWI,3,Ohio,2015-05-01,We Can Work It Out,2015-06-06 16:49:19
2,DKQSXVNJDH,35,New Jersey,2015-05-04,Back In the U.S.S.R.,2015-06-14 02:11:29
3,HLHRIDQTUW,126,Illinois,2015-05-16,P.s. I Love You,2015-06-08 12:26:10
4,SUKJCSBCYW,6,New Jersey,2015-05-01,Sgt. Pepper's Lonely Hearts Club Band,2015-06-28 14:57:00


In [6]:
song.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 4000 non-null   object
 1   user_id            4000 non-null   int64 
 2   user_state         4000 non-null   object
 3   user_sign_up_date  4000 non-null   object
 4   song_played        4000 non-null   object
 5   time_played        4000 non-null   object
dtypes: int64(1), object(5)
memory usage: 187.6+ KB


In [7]:
song.describe()

Unnamed: 0,user_id
count,4000.0
mean,101.574
std,58.766835
min,1.0
25%,48.0
50%,102.0
75%,155.0
max,200.0


In [9]:
for col in song.columns:
    print(song[col].value_counts())

id
RFNSPZLYYC    1
PGKFYSWGYC    1
GZWDHMNTML    1
EHKNWVHOWT    1
VLKFPMHZUG    1
             ..
SUKJCSBCYW    1
HLHRIDQTUW    1
DKQSXVNJDH    1
HWKKBQKNWI    1
GOQMMKSQQH    1
Name: count, Length: 4000, dtype: int64
user_id
42     52
98     45
49     44
193    43
55     42
       ..
8       2
50      2
129     1
63      1
21      1
Name: count, Length: 196, dtype: int64
user_state
New York          469
California        425
Texas             230
Ohio              209
Florida           180
Pennsylvania      179
North Carolina    154
Illinois          149
Georgia           135
Missouri          127
New Jersey        117
Maryland          112
Louisiana         105
Alabama           104
Tennessee         102
Wisconsin          95
Massachusetts      91
Mississippi        85
South Carolina     85
Michigan           80
Kentucky           78
Oregon             62
Alaska             58
Indiana            55
Colorado           54
Oklahoma           49
Minnesota          42
Washington         

In [10]:
song['time_played'] = pd.to_datetime(song['time_played'])
song['user_sign_up_date'] = pd.to_datetime(song['user_sign_up_date'])

First question :
1-What are the top 3 and the bottom 3 states in terms of number of users?

In [11]:
users_per_state = song.groupby('user_state')['user_id'].nunique().sort_values(ascending=False)

In [12]:
print("TOP 3 STATES :")
top_3 = users_per_state.head(3)
for i, (state, count) in enumerate(top_3.items(), 1):
    print(f"{i}. {state}: {count} users")

TOP 3 STATES:
1. New York: 23 users
2. California: 21 users
3. Texas: 15 users


In [13]:
print("BOTTOM 3 STATES :")
bottom_3 = users_per_state.tail(3)
for i, (state, count) in enumerate(reversed(list(bottom_3.items())), 1):
    print(f"{i}. {state}: {count} users")

BOTTOM 3 STATES :
1. North Dakota: 1 users
2. Rhode Island: 1 users
3. Nebraska: 1 users


Second question :
2-What are the top 3 and the bottom 3 states in terms of user engagement? You can choose how to mathematically define user engagement. What the CEO cares about here is in which states users are using the product a lot/very little.

In [14]:
engagement_by_state = song.groupby('user_state').agg({
    'user_id': 'nunique',  # number of users
    'song_played': 'count'  # total songs played
}).reset_index()

In [15]:
engagement_by_state['avg_songs_per_user'] = engagement_by_state['song_played'] / engagement_by_state['user_id']

In [16]:
engagement_sorted = engagement_by_state.sort_values('avg_songs_per_user', ascending=False)

In [17]:
print("TOP 3 STATES :")
top_3_engagement = engagement_sorted.head(3)
for i, row in enumerate(top_3_engagement.itertuples(), 1):
    print(f"{i}. {row.user_state}: {row.avg_songs_per_user:.1f} songs per user ({row.user_id} users, {row.song_played} total plays)")

TOP 3 STATES :
1. Nebraska: 36.0 songs per user (1 users, 36 total plays)
2. Alaska: 29.0 songs per user (2 users, 58 total plays)
3. Mississippi: 28.3 songs per user (3 users, 85 total plays)


In [18]:
print("BOTTOM 3 STATES :")
bottom_3_engagement = engagement_sorted.tail(3)
for i, row in enumerate(reversed(list(bottom_3_engagement.itertuples())), 1):
    print(f"{i}. {row.user_state}: {row.avg_songs_per_user:.1f} songs per user ({row.user_id} users, {row.song_played} total plays)")

BOTTOM 3 STATES :
1. Kansas: 8.0 songs per user (1 users, 8 total plays)
2. Virginia: 8.5 songs per user (2 users, 17 total plays)
3. Minnesota: 10.5 songs per user (4 users, 42 total plays)


Third question : 3-The CEO wants to send a gift to the first user who signed-up for each state. That is, the first user who signed-up from California, from Oregon, etc. Can you give him a list of those users?

In [26]:
def find_first_user(song):
    idx = song['user_sign_up_date'].idxmin()
    return song.loc[idx, ['user_id', 'user_sign_up_date']]

In [28]:
first_users = song.groupby('user_state').apply(find_first_user, include_groups=False).sort_values(by='user_sign_up_date')

In [29]:
for state, row in first_users.iterrows():
    print(f"{state}: User {row['user_id']} (signed up: {row['user_sign_up_date']})")

Alabama: User 5 (signed up: 2015-05-01 00:00:00)
Texas: User 7 (signed up: 2015-05-01 00:00:00)
Oregon: User 1 (signed up: 2015-05-01 00:00:00)
Ohio: User 3 (signed up: 2015-05-01 00:00:00)
North Carolina: User 2 (signed up: 2015-05-01 00:00:00)
New Mexico: User 4 (signed up: 2015-05-01 00:00:00)
New Jersey: User 6 (signed up: 2015-05-01 00:00:00)
Pennsylvania: User 11 (signed up: 2015-05-02 00:00:00)
New York: User 19 (signed up: 2015-05-02 00:00:00)
Minnesota: User 8 (signed up: 2015-05-02 00:00:00)
Michigan: User 13 (signed up: 2015-05-02 00:00:00)
Massachusetts: User 15 (signed up: 2015-05-02 00:00:00)
Maryland: User 18 (signed up: 2015-05-02 00:00:00)
Mississippi: User 23 (signed up: 2015-05-02 00:00:00)
Georgia: User 20 (signed up: 2015-05-02 00:00:00)
Utah: User 29 (signed up: 2015-05-03 00:00:00)
Kentucky: User 34 (signed up: 2015-05-04 00:00:00)
California: User 39 (signed up: 2015-05-04 00:00:00)
Florida: User 41 (signed up: 2015-05-04 00:00:00)
Wisconsin: User 32 (signed up: