# Explore Sample Data

For the purpose of this exercise it was given a sample data so then we could generate data to test our code. For that reason, before moving on in any direction, before creating our system to reach the goal of the exercise or generate any data, we will first do a quick analysis of tha data that was provided, so then we know what are we working with. 

In [10]:
import pandas as pd 

In [11]:
df = pd.read_csv('data/sample_listen-2021-12-01.txt', sep='|', header=None, names=['sng_id', 'user_id', 'country'])
df

Unnamed: 0,sng_id,user_id,country
0,29569957,7788196,BL
1,27818575,9642856,GN
2,14684680,8316482,BN
3,21133485,6802606,ST
4,11751494,3748041,PW
...,...,...,...
95,8458924,5951329,NG
96,1770944,1833529,CL
97,22468228,3060444,CW
98,11716006,8067420,BI


In [12]:
print('Unique: \n', df.nunique())
print('\n')
print('Countries:\n', df['country'].unique().tolist())
print('\n')
print('General Information:')
print(df.info())
print('\n')
print('Shape: \n' ,df.shape)

Unique: 
 sng_id     100
user_id    100
country     83
dtype: int64


Countries:
 [' BL', ' GN', ' BN', ' ST', ' PW', ' HK', ' NL', ' HM', ' NR', ' GT', ' GU', ' PA', ' CW', ' GA', ' KG', ' MH', ' GI', ' WS', ' BQ', ' AE', ' IS', ' YE', ' NI', ' JO', ' LY', ' OM', ' ZM', ' BF', ' ID', ' RS', ' GL', ' IE', ' LK', ' DZ', ' UM', ' GB', ' SA', ' MO', ' HN', ' ZA', ' HU', ' DM', ' LV', ' GD', ' WF', ' PY', ' MM', ' VE', ' PH', ' KZ', ' NE', ' IQ', ' NU', ' KH', ' KN', ' MN', ' AU', ' CL', ' MV', ' CN', ' BW', ' SY', ' LR', ' MT', ' BS', ' PR', ' GQ', ' TV', ' AF', ' GF', ' SH', ' MS', ' YT', ' FI', ' BG', ' BI', ' AW', ' TO', ' PM', ' HR', ' JE', ' NG', ' TT']


General Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   sng_id   100 non-null    int64 
 1   user_id  100 non-null    int64 
 2   country  100 non-null    object
dtypes: int64(2), object(

> So, as we already knew, we have 100 streams and 3 features, all the iD users are unique as well as the ID songs, so we can't calculate a top 50, what we probably will do to see if our code works is generate 7 samples, based on the sample that was provided, and then test our code. We also can see that we have non missing values.

In [13]:
# Within the sample provided, let's analyse the top 10 countries we more users

country_top_10 = df.groupby('country')['user_id'].size().sort_values(ascending=False)

country_top_10.head(10)

country
 HN    3
 BN    3
 CW    2
 CL    2
 GU    2
 MM    2
 GI    2
 IQ    2
 GA    2
 HK    2
Name: user_id, dtype: int64

> Here we can see, that from the sample provided, the countries with more users using deezer is `HN` (Honduras), and `BN` (Brunei Darussalam) with 3 users both. Followed by `CW` (Curação), `CL` (Chile), `GU` (Guam), `MM` (Myanmar), `GI` (Gibraltar), `IQ` (Iraq), `GA` (Gabon), `HK` (Hong Kong), with 2 users each. Just a quick anaysis, from the sample provided, the market of deezer is not focused in Europe, more in Africa, Asia and Central/South América. But here we only have 100 streams, we can't conclude much.

In [30]:
# Because the goal of the exercise includes compute the top_songs per country, let's try it in our sample
# We already know that we don't have a top_50 in our sample, let's try top_3_songs

grouped = df.groupby(['country', 'sng_id']).size().reset_index(name='streams')
sorted_grouped = grouped.sort_values(['country', 'streams'], ascending=[True, False])

sorted_grouped 

Unnamed: 0,country,sng_id,streams
0,AE,22400777,1
1,AF,20969661,1
2,AU,13376007,1
3,AW,17748817,1
4,BF,13957038,1
...,...,...,...
95,YE,24092012,1
96,YT,14765920,1
97,ZA,3355512,1
98,ZA,13127664,1


In [33]:
# Songs streamed per country

num_songs = df.groupby('country')['sng_id'].count().sort_values(ascending = False).to_frame()

num_songs

Unnamed: 0_level_0,sng_id
country,Unnamed: 1_level_1
HN,3
BN,3
CW,2
CL,2
GU,2
...,...
IE,1
ID,1
HU,1
HR,1
