<a href="https://colab.research.google.com/github/ludawg44/jigsawlabs/blob/master/Copy_of_91_stacking_and_sort_lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stacking and Sorting Lab

### Introduction

Ok, we let's use our knowledge of queries and sorting to further explore the spotify dataset.

### Loading Data

We'll load up the data, and take a look.

In [0]:
import pandas as pd
import numpy as np
url = "https://raw.githubusercontent.com/jigsawlabs-student/numpy-intro/master/top10s.csv"
df = pd.read_csv(url, encoding = "ISO-8859-1", index_col = 0)
df.head()

Unnamed: 0,title,artist,top genre,year,bpm,nrgy,dnce,dB,live,val,dur,acous,spch,pop
1,"Hey, Soul Sister",Train,neo mellow,2010,97,89,67,-4,8,80,217,19,4,83
2,Love The Way You Lie,Eminem,detroit hip hop,2010,87,93,75,-5,52,64,263,24,23,82
3,TiK ToK,Kesha,dance pop,2010,120,84,76,-3,29,71,200,10,14,80
4,Bad Romance,Lady Gaga,dance pop,2010,119,92,70,-4,8,71,295,0,4,79
5,Just the Way You Are,Bruno Mars,pop,2010,109,84,64,-5,9,43,221,2,4,78


Coerce the data to numpy.

In [0]:
spotify_np = df.to_numpy()
spotify_np[:2]

array([['Hey, Soul Sister', 'Train', 'neo mellow', 2010, 97, 89, 67, -4,
        8, 80, 217, 19, 4, 83],
       ['Love The Way You Lie', 'Eminem', 'detroit hip hop', 2010, 87,
        93, 75, -5, 52, 64, 263, 24, 23, 82]], dtype=object)

And store the list of pandas columns in an array.

In [0]:
spotify_cols = df.columns
spotify_cols

Index(['title', 'artist', 'top genre', 'year', 'bpm', 'nrgy', 'dnce', 'dB',
       'live', 'val', 'dur', 'acous', 'spch', 'pop'],
      dtype='object')

> We can get a reference the [Spotify API here](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/) to get a sense of what these attributes mean.

### Finding Top Genres

Now we can see that the last attribute looks at poppiness.  Begin by ordering the tracks by poppiness score, from highest to lowest. 

In [0]:
most_poppy = None

In [0]:
most_poppy[:3]

# array([['Memories', 'Maroon 5', 'pop', 2019, 91, 32, 76, -7, 8, 57, 189,
#         84, 5, 99],
#        ['Lose You To Love Me', 'Selena Gomez', 'dance pop', 2019, 102,
#         34, 51, -9, 21, 9, 206, 58, 4, 97],
#        ['Someone You Loved', 'Lewis Capaldi', 'pop', 2019, 110, 41, 50,
#         -6, 11, 45, 182, 75, 3, 96]], dtype=object)

Looking at the top three songs, we see that the category is some variant of pop.  Let's get a unique list of the genres of the 25 most poppy songs.

In [0]:
unique_poppy_genres = None
unique_poppy_genres

# (array(['big room', 'boy band', 'brostep', 'canadian contemporary r&b',
#         'canadian pop', 'dance pop', 'edm', 'electropop', 'escape room',
#         'neo mellow', 'pop'], dtype=object),
#  array([1, 2, 1, 1, 1, 4, 1, 2, 2, 1, 9]))

Next create turn the tuple from `unique_poppy_genres` into an array.  The first column should have the names of the poppy genres, and the second column should have the related counts.  It should be sorted from highest to lowest by count.

> Assign it to the variable `sorted_pop_genre_counts`.  (It might take a few steps to get there.)

In [0]:
sorted_pop_genre_counts = None

In [0]:
sorted_pop_genre_counts

# array([['pop', 9],
#        ['dance pop', 4],
#        ['escape room', 2],
#        ['electropop', 2],
#        ['boy band', 2],
#        ['neo mellow', 1],
#        ['edm', 1],
#        ['canadian pop', 1],
#        ['canadian contemporary r&b', 1],
#        ['brostep', 1],
#        ['big room', 1]], dtype=object)

array([['pop', 9],
       ['dance pop', 4],
       ['escape room', 2],
       ['electropop', 2],
       ['boy band', 2],
       ['neo mellow', 1],
       ['edm', 1],
       ['canadian pop', 1],
       ['canadian contemporary r&b', 1],
       ['brostep', 1],
       ['big room', 1]], dtype=object)

So it looks like genres like escape room and boy band, might be similar to pop.

### Coercing Data

Let's see the counts of different categories in general and see if there are some with few counts that we could combine.  To do so, we can perhaps take our previous code and turn it into a function.  

Write a function called `value_counts` that takes an argument of a series of data, and returns the number of times that each value shows up.

In [0]:
def value_counts(col):
    pass

> So for example, if we pass through the genre column, we should see the following:

In [0]:
genre_col = spotify_np[:, 2]

In [0]:
value_counts(genre_col)[:10]
# array([['dance pop', 327],
#        ['pop', 60],
#        ['canadian pop', 34],
#        ['barbadian pop', 15],
#        ['boy band', 15],
#        ['electropop', 13],
#        ['british soul', 11],
#        ['big room', 10],
#        ['canadian contemporary r&b', 9],
#        ['neo mellow', 9]], dtype=object)

array([['dance pop', 327],
       ['pop', 60],
       ['canadian pop', 34],
       ['barbadian pop', 15],
       ['boy band', 15],
       ['electropop', 13],
       ['british soul', 11],
       ['big room', 10],
       ['canadian contemporary r&b', 9],
       ['neo mellow', 9]], dtype=object)

Let's select tracks of category `pop`, `dance pop`, `canadian pop` and `barabadian pop` and find the average poppiness score.

In [0]:

avg_poppiness = None

avg_poppiness
# 66.4916

Next let's replace the genre of any song that has a poppy score above 66 with the genre "pop", and if it's not above this score, keep it as is.

> Use `np.where` to accomplish this.

In [0]:
# write code here


And now look at the new counts, after making this change to the data.

In [0]:
genre_col = spotify_np[:, 2]
value_counts(genre_col)[:10]

# array([['pop', 362],
#        ['dance pop', 156],
#        ['canadian pop', 10],
#        ['barbadian pop', 8],
#        ['neo mellow', 7],
#        ['hip pop', 5],
#        ['boy band', 5],
#        ['art pop', 4],
#        ['atl hip hop', 4],
#        ['australian dance', 4]], dtype=object)

array([['pop', 362],
       ['dance pop', 156],
       ['canadian pop', 10],
       ['barbadian pop', 8],
       ['neo mellow', 7],
       ['hip pop', 5],
       ['boy band', 5],
       ['art pop', 4],
       ['atl hip hop', 4],
       ['australian dance', 4]], dtype=object)

We can see that this greatly condensed to be the category pop.  (This may not have been a good thing.)

### Summary

In this lesson, we practiced using the argsort method to sort our data, and using the `stack` method to combine our arrays, where necessary.  We also practiced finding and replacing our data with the `np.where` method.