## Data Analytics using pandas

In [114]:
# importing library

import pandas as pd

In [115]:
# importing music file

music = pd.read_csv('music.csv')

### Basics Filtering

In [116]:
# filter music who's artist is from UK

uk_artist = music[music['country'] == 'UK']
uk_artist

Unnamed: 0,artist,country,plays,genre,fans
0,The Beatles,UK,150,rock,50
1,Pink Floyd,UK,10000,rock,1500
8,Iron Maiden,UK,20000,metal,3500
9,Judas Priest,UK,5000,metal,1000


In [117]:
# another way

music[music.country == 'UK']

Unnamed: 0,artist,country,plays,genre,fans
0,The Beatles,UK,150,rock,50
1,Pink Floyd,UK,10000,rock,1500
8,Iron Maiden,UK,20000,metal,3500
9,Judas Priest,UK,5000,metal,1000


### Multiple filters

We want to filter the “music” DataFrame to select only the rock genre who have 200 plays or more.

In [118]:
rock_200 = music[(music['genre'] == 'rock') & (music['plays'] >= 200)]
rock_200

Unnamed: 0,artist,country,plays,genre,fans
1,Pink Floyd,UK,10000,rock,1500
3,Cairokee,Egypt,200,rock,10
4,ACDC,US,250,rock,20
5,The Doors,US,1000,rock,80
6,Poets of The Fall,Finland,250,rock,10


In [119]:
rock_200_250 = music[(music['genre'] == 'rock') & (music['plays'] > 200) & (music['plays'] <= 500)]
rock_200_250

Unnamed: 0,artist,country,plays,genre,fans
4,ACDC,US,250,rock,20
6,Poets of The Fall,Finland,250,rock,10


### Negation

you are trying to filter artists outside the UK

In [120]:
# one way

outside_uk = music[~(music['country'] == 'UK')]
outside_uk

Unnamed: 0,artist,country,plays,genre,fans
2,Metallica,US,500,metal,50
3,Cairokee,Egypt,200,rock,10
4,ACDC,US,250,rock,20
5,The Doors,US,1000,rock,80
6,Poets of The Fall,Finland,250,rock,10
7,Megadeth,US,300,metal,20


In [121]:
# another way

outside_uk = music[music['country'] != 'UK']
outside_uk

Unnamed: 0,artist,country,plays,genre,fans
2,Metallica,US,500,metal,50
3,Cairokee,Egypt,200,rock,10
4,ACDC,US,250,rock,20
5,The Doors,US,1000,rock,80
6,Poets of The Fall,Finland,250,rock,10
7,Megadeth,US,300,metal,20


## Challenge: Multiple Filters

Problem definition#
A music label wants to evaluate the success of its artists in the past month. However, it is unfair to evaluate based on play count across different countries. The music label would like to view at the same time:

Artists outside the UK who have > 100 plays
Artists inside the UK who have > 200 plays

In [122]:
out = music[(~(music['country'] == 'UK') & (music['plays'] >100)) | ((music['country'] == 'UK') & (music['plays'] > 200))]
out['artist']

1           Pink Floyd
2            Metallica
3             Cairokee
4                 ACDC
5            The Doors
6    Poets of The Fall
7             Megadeth
8          Iron Maiden
9         Judas Priest
Name: artist, dtype: object

### Filtering by a List/String Filters

In this example, you want to filter artists who originate from either the US or the UK

In [123]:
country_list = list(['US','UK'])
out = music[music['country'].isin(country_list)]
out

Unnamed: 0,artist,country,plays,genre,fans
0,The Beatles,UK,150,rock,50
1,Pink Floyd,UK,10000,rock,1500
2,Metallica,US,500,metal,50
4,ACDC,US,250,rock,20
5,The Doors,US,1000,rock,80
7,Megadeth,US,300,metal,20
8,Iron Maiden,UK,20000,metal,3500
9,Judas Priest,UK,5000,metal,1000


Example: Filtering artists whose name starts with `The`

In [124]:
out = music[music['artist'].str.startswith('The')]
out

Unnamed: 0,artist,country,plays,genre,fans
0,The Beatles,UK,150,rock,50
5,The Doors,US,1000,rock,80


In [125]:
music[music['artist'].str.contains('Met')]

Unnamed: 0,artist,country,plays,genre,fans
2,Metallica,US,500,metal,50


Your music data analyst is getting more curious; they want to know a list of all artists from the UK or Finland whose name contains the word The.

In [126]:
artist_list = list(['UK','Finland'])
out = music[(music['artist'].isin(artist_list)) | (music['artist'].str.contains('The'))]
out

Unnamed: 0,artist,country,plays,genre,fans
0,The Beatles,UK,150,rock,50
5,The Doors,US,1000,rock,80
6,Poets of The Fall,Finland,250,rock,10


### Problem Definition
<br> Your music analyst is getting more selective about the countries of origin of artists. They want to exclude artists from the UK or Finland, with the exception of still returning artists who have >= 10000 plays.

In [127]:
country_list = ['UK','Finland']
out = music[~(music['country'].isin(country_list)) | (music['plays'] >= 10000)]
list(out['artist'].values)

['Pink Floyd',
 'Metallica',
 'Cairokee',
 'ACDC',
 'The Doors',
 'Megadeth',
 'Iron Maiden']

## Grouping

#### Problem definition<br>
Your music analyst is interested in finding the sum of plays of artists by country. Can you return the total number of plays for each country?

In [129]:
out = music.groupby('country').sum('plays').to_dict()

In [130]:
out

{'plays': {'Egypt': 200, 'Finland': 250, 'UK': 35150, 'US': 2050},
 'fans': {'Egypt': 10, 'Finland': 10, 'UK': 6050, 'US': 170}}

### Problem definition
Your music analyst has realized that it’s unfair to view the total number of plays per country without counting in the effect of genres. Can you return the total number of plays per each country/genre combination?

In [131]:
out = music.groupby(['country','genre']).plays.sum().to_dict()

In [132]:
out

{('Egypt', 'rock'): 200,
 ('Finland', 'rock'): 250,
 ('UK', 'metal'): 25000,
 ('UK', 'rock'): 10150,
 ('US', 'metal'): 800,
 ('US', 'rock'): 1250}

### Problem definition
Your music analyst is trying to find which countries’ artists have a combined number of plays greater than 1000. Can you return a list of such country names?

In [155]:
out = music.groupby('country').plays.sum()
out = out[out > 1000]
list(out.index)

['UK', 'US']

### Problem definition
Your music analyst would like to know the ratio of plays/fans (plays per fan) to see how dedicated listeners are to artists. This comes with a twist: it should be done per nation, not per artist.



In [162]:
out = music.groupby('country').apply(lambda x : x.plays.sum()/x.fans.sum()).to_dict()

In [163]:
out

{'Egypt': 20.0,
 'Finland': 25.0,
 'UK': 5.809917355371901,
 'US': 12.058823529411764}

### Problem definition
Your music analyst is interested in knowing multiple statistics at once, grouped by country.These are as follows:

Sum of plays<br>
Average of plays<br>
Maximum fans from all artists in the country

In [177]:
out = music.groupby('country').agg({'plays':['sum','mean'],'fans':['max']})

In [178]:
out.columns = ['_'.join(col) for col in out.columns.values]

In [182]:
out.to_dict(orient='index')

{'Egypt': {'plays_sum': 200, 'plays_mean': 200.0, 'fans_max': 10},
 'Finland': {'plays_sum': 250, 'plays_mean': 250.0, 'fans_max': 10},
 'UK': {'plays_sum': 35150, 'plays_mean': 8787.5, 'fans_max': 3500},
 'US': {'plays_sum': 2050, 'plays_mean': 512.5, 'fans_max': 80}}