<a href="https://colab.research.google.com/github/ludawg44/jigsawlabs/blob/master/Copy_of_9_stacking_sorting_and_replacing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stacking, Sorting and Coercing

### Introduction

So far we have used queries to select data, but beyond selecting data we also may wish use sorting on our data, as well as change some of our data.

### Loading Data

In [0]:
import pandas as pd

url = "https://raw.githubusercontent.com/ledeprogram/courses/master/foundations/mapping/tilemill/yelp-lunch-nyc.csv"
lunch_df = pd.read_csv(url)


lunch_cols = lunch_df.columns 
lunch_df.head()

Unnamed: 0,Name,Address,City,Category,Rating,URL
0,Rambling House,4292 Katonah Ave,Bronx,Pubs,4.0,http://www.yelp.com/biz/rambling-house-bronx
1,Curry Spot,4268 Katonah Ave,Bronx,Indian,4.0,http://www.yelp.com/biz/curry-spot-bronx
2,Eileens Country Kitchen,964 McLean Ave,Yonkers,American (Traditional),3.5,http://www.yelp.com/biz/eileens-country-kitche...
3,Ali's Roti Shop,4220 White Plains Rd,Bronx,Trinidadian,4.0,http://www.yelp.com/biz/alis-roti-shop-bronx
4,HIM Ital Health Food Market,4374b White Plains Rd,Bronx,Health Markets,4.5,http://www.yelp.com/biz/him-ital-health-food-m...


In [0]:
lunch_np = lunch_df.to_numpy()

In [0]:
lunch_np[:2]

array([['Rambling House', '4292 Katonah Ave', 'Bronx', 'Pubs', 4.0,
        'http://www.yelp.com/biz/rambling-house-bronx'],
       ['Curry Spot', '4268 Katonah Ave', 'Bronx', 'Indian', 4.0,
        'http://www.yelp.com/biz/curry-spot-bronx']], dtype=object)

### Sorting Data

Let's say that we wish to find out which categories occur the most.  We have seen that we can kind of get this by using `unique` with the option `return_counts = True`.

In [0]:
import numpy as np

unique_counts_pairs = np.unique(lunch_np[:, 3], return_counts = True)

This returns to us a tuple, where the first element is an array of the unique restaurants, and the second element is an array of the related counts.

In [0]:
unique_restaurants = unique_counts_pairs[0]
unique_counts = unique_counts_pairs[1]

In [0]:
unique_restaurants[:3]

array(['Afghan', 'American (New)', 'American (Traditional)'], dtype=object)

In [0]:
unique_counts[:3]

array([  2, 259, 173])

Now the first thing we should do is combine these arrays, so that they consist of columns of a numpy array.  We can do so using the `stack` method.

In [0]:
category_counts = np.stack([unique_restaurants, unique_counts])

> So stack places two numpy arrays into a larger array. By default it stacks as one row on top of the other. 

In [0]:
category_counts[:, 0:20]

array([['Afghan', 'American (New)', 'American (Traditional)',
        'Argentine', 'Asian Fusion', 'Bagels', 'Bakeries', 'Barbeque',
        'Bars', 'Brazilian', 'Breakfast & Brunch', 'Bubble Tea',
        'Buffets', 'Burgers', 'Cafes', 'Cajun/Creole', 'Cambodian',
        'Cantonese', 'Caribbean', 'Caterers'],
       [2, 259, 173, 1, 52, 6, 57, 50, 15, 12, 23, 3, 12, 131, 62, 5, 19,
        1, 110, 40]], dtype=object)

But we can stack them column-wise by providing axis = 1.

In [0]:
category_counts = np.stack([unique_restaurants, unique_counts], axis = 1)

In [0]:
category_counts[:15]

array([['Afghan', 2],
       ['American (New)', 259],
       ['American (Traditional)', 173],
       ['Argentine', 1],
       ['Asian Fusion', 52],
       ['Bagels', 6],
       ['Bakeries', 57],
       ['Barbeque', 50],
       ['Bars', 15],
       ['Brazilian', 12],
       ['Breakfast & Brunch', 23],
       ['Bubble Tea', 3],
       ['Buffets', 12],
       ['Burgers', 131],
       ['Cafes', 62]], dtype=object)

### Sorting our columns

Now that we have `category_counts` indicating the category and the related number of restaurants, let's try to sort these from largest to smallest.    

1. An attempt gone wrong

For our first attempt, we can select the second column and then call `sort`.

In [0]:
category_counts[:, 1].sort()

In [0]:
category_counts[:5]

array([['Afghan', 1],
       ['American (New)', 1],
       ['American (Traditional)', 1],
       ['Argentine', 1],
       ['Asian Fusion', 1]], dtype=object)

This did **not** do what we wanted at all.  Here, it *only* changed the second column in our `category_counts` array.  It did not change the order as the first column.  So now we're in a worse spot, as our two columns don't line up.

Let's try again.

2. A second attempt

First we'll recreate `category_counts`.

In [0]:
category_counts = np.stack([unique_restaurants, unique_counts], axis = 1)
category_counts[:3]

array([['Afghan', 2],
       ['American (New)', 259],
       ['American (Traditional)', 173]], dtype=object)

Now before sorting, let's just try to find the amount of times that the most popular cuisine occurs.

In [0]:
np.max(category_counts[:, 1])

661

Now to find what type of cuisine that is, we can use `np.argmax`.  The `np.argmax` function returns to us the `index` where the maximum value is located.

In [0]:
max_idx = np.argmax(category_counts[:, 1])
max_idx

71

So at position 71, we'll find our maximum value.  Now we can select this row.

In [0]:
category_counts[max_idx]

array(['Pizza', 661], dtype=object)

Ok, so here we see that the category that appears the most is `Pizza`.

### Sorting For Real 

So now that we saw how to use the `np.argmax` function to return the position of maximum value, let's see what `np.argsort` does.

In [0]:
category_order = np.argsort(category_counts[:, 1])
category_order

array([ 81,   3,  60,  41,  40,  36,  35,  17,   0,  89,  74,  69,  47,
        45,  34,  24,  52,  25,  67,  39,  11,  93,  80,  68,  50,  87,
        82,  96,  28,  91,  15,   5,  59,  49, 104,  98,  62,  99,  92,
        30,   9,  12,  72, 102,  46,  48,  94,   8,  20,  38,  44,  16,
        61,  42,  37,  84,  63,  57,  73,  77,  76,  10,  33,  97,  31,
        88,  32,  83,  86,  21,  70, 100,  19,  55,  75,   7,   4,   6,
        14,  43,  26,  85,  56,  66, 103,  29,  79,  23,  18, 101,  64,
        13,  90,  95,   2,  51,  27,  78,  58,  54,   1,  65,  53,  22,
        71])

> Argsort returns an array of indices in sorted order of the provided array.

So here we have the indices according to the category counts.  But it's ordered by count, lowest to highest.  We want highest to lowest.  We can reverse the order like so.

In [0]:
category_order_desc = category_order[::-1]
category_order_desc[:10]

array([71, 22, 53, 65,  1, 54, 58, 78, 27, 51])

Ok, much better.  Now we can uses these indices to order our `category_counts` array.

In [0]:
ordered_categories = category_counts[category_order_desc]
ordered_categories[:20]

array([['Pizza', 661],
       ['Chinese', 300],
       ['Italian', 285],
       ['Mexican', 259],
       ['American (New)', 259],
       ['Japanese', 247],
       ['Latin American', 233],
       ['Sandwiches', 226],
       ['Delis', 225],
       ['Indian', 181],
       ['American (Traditional)', 173],
       ['Thai', 171],
       ['Sushi Bars', 166],
       ['Burgers', 131],
       ['Mediterranean', 113],
       ['Vegetarian', 113],
       ['Caribbean', 110],
       ['Coffee & Tea', 99],
       ['Seafood', 95],
       ['Diners', 94]], dtype=object)

### Coercing Data

Now so far we have focused on querying data.  But we may also need to coerce our data.  For example, let's look at the cuisines that occur the least.

In [0]:
ordered_categories[::-1][:30]

array([['Senegalese', 1],
       ['Argentine', 1],
       ['Local Flavor', 1],
       ['Gastropubs', 1],
       ['Gas & Service Stations', 1],
       ['Food Delivery Services', 1],
       ['Food', 1],
       ['Cantonese', 1],
       ['Afghan', 2],
       ['Street Vendors', 2],
       ['Ramen', 2],
       ['Pakistani', 2],
       ['Hawaiian', 2],
       ['Haitian', 2],
       ['Filipino', 2],
       ['Comfort Food', 2],
       ['Irish', 2],
       ['Creperies', 3],
       ['Modern European', 3],
       ['Fruits & Veggies', 3],
       ['Bubble Tea', 3],
       ['Tapas Bars', 3],
       ['Seafood Markets', 4],
       ['Music Venues', 4],
       ['Ice Cream & Frozen Yogurt', 4],
       ['Sports Bars', 5],
       ['Shanghainese', 5],
       ['Trinidadian', 5],
       ['Desserts', 5],
       ['Szechuan', 5]], dtype=object)

We can see that `Szechuan`, `Cantonese`, and `Shanghainese` and `Taiwanese` each do not occur that often, and could be added to Chinese food.

In [0]:
ordered_categories[:, 0]

array(['Pizza', 'Chinese', 'Italian', 'Mexican', 'American (New)',
       'Japanese', 'Latin American', 'Sandwiches', 'Delis', 'Indian',
       'American (Traditional)', 'Thai', 'Sushi Bars', 'Burgers',
       'Mediterranean', 'Vegetarian', 'Caribbean', 'Coffee & Tea',
       'Seafood', 'Diners', 'Vietnamese', 'Middle Eastern', 'Korean',
       'Spanish', 'Cuban', 'Greek', 'Cafes', 'Bakeries', 'Asian Fusion',
       'Barbeque', 'Restaurants', 'Juice Bars & Smoothies', 'Caterers',
       'Vegan', 'Peruvian', 'Chicken Wings', 'Specialty Food',
       'Soul Food', 'Falafel', 'Steakhouses', 'Ethnic Food', 'Turkish',
       'Fast Food', 'Breakfast & Brunch', 'Russian', 'Salad', 'Pubs',
       'Kosher', 'Meat Shops', 'Southern', 'Food Stands', 'German',
       'Lounges', 'Cambodian', 'Grocery', 'French', 'Cheese Shops',
       'Bars', 'Tex-Mex', 'Health Markets', 'Halal', 'Venezuelan',
       'Polish', 'Buffets', 'Brazilian', 'Dominican', 'Taiwanese',
       'Uzbek', 'Malaysian', 'Ukrainian'

Let's assign each of the restaurants in this category to be cuisine `Chinese`.

In [0]:
chinese_regional = ['Szechuan', 'Cantonese', 'Shanghainese', 'Taiwanese']

selected_rests = lunch_np[np.isin(lunch_np[:, 3], chinese_regional)]

Then we can assign the fourth column to be 'Chinese'.

In [0]:
selected_rests[:, 3]

array(['Cantonese', 'Szechuan', 'Shanghainese', 'Taiwanese', 'Szechuan',
       'Shanghainese', 'Taiwanese', 'Shanghainese', 'Taiwanese',
       'Taiwanese', 'Taiwanese', 'Taiwanese', 'Szechuan', 'Taiwanese',
       'Taiwanese', 'Taiwanese', 'Taiwanese', 'Szechuan', 'Shanghainese',
       'Taiwanese', 'Szechuan', 'Shanghainese'], dtype=object)

In [0]:
lunch_np[np.isin(lunch_np[:, 3], chinese_regional)] = 'Chinese'

In [0]:
lunch_np[np.isin(lunch_np[:, 3], chinese_regional)]

array([], shape=(0, 6), dtype=object)

### Complex Assignment with Where

Another way of performing a find and replace is with the `np.where` method.  Here is how we can use `np.where`.

In [0]:
arr = np.arange(0, 5)

arr

array([0, 1, 2, 3, 4])

In [0]:
np.where(arr < 2, 'under 2', '2 or over')

array(['under 2', 'under 2', '2 or over', '2 or over', '2 or over'],
      dtype='<U9')

So with `np.where` we provide the condition in the first argument, and then we provide the value if the condition returns True, and the value if the condition returns False.

In [0]:
np.where(arr < 2, 1, 0)

array([1, 1, 0, 0, 0])

Using np.where can be good for cleaning up data.  For example, let's say that we want to change all ratings of `3.5` to a `3`.  We can do so by saying if the rating is `3.5` reassign it to `3`, otherwise keep it the same.

In [0]:
lunch_np[:3]

array([['Rambling House', '4292 Katonah Ave', 'Bronx', 'Pubs', 4.0,
        'http://www.yelp.com/biz/rambling-house-bronx'],
       ['Curry Spot', '4268 Katonah Ave', 'Bronx', 'Indian', 4.0,
        'http://www.yelp.com/biz/curry-spot-bronx'],
       ['Eileens Country Kitchen', '964 McLean Ave', 'Yonkers',
        'American (Traditional)', 3.5,
        'http://www.yelp.com/biz/eileens-country-kitchen-yonkers']],
      dtype=object)

In [0]:
corerced_vals = np.where(lunch_np[:, 4] == 3.5, 3, lunch_np[:, 4])
set(corerced_vals)

{1.0, 1.5, 2.0, 2.5, 3, 4.0, 4.5, 5.0}

And now we can update the `lunch_np` array.

In [0]:
lunch_np[:, 4] = corerced_vals

In [0]:
set(lunch_np[:, 4])

{1.0, 1.5, 2.0, 2.5, 3, 4.0, 4.5, 5.0}

### Summary

In this lesson we learned about stacking, sorting and coercing our data.  With the stack method, we place a collection of numpy arrays inside of another array.

In [0]:
unique_counts_pairs = np.unique(lunch_np[:, 3], return_counts = True)
unique_restaurants = unique_counts_pairs[0]
unique_counts = unique_counts_pairs[1]

# use stack
category_counts = np.stack([unique_restaurants, unique_counts], axis = 1)
category_counts[:2]

array([['Afghan', 2],
       ['American (New)', 259]], dtype=object)

Then we saw that sorting nested arrays in numpy can be tricky.  We can do so by using the `argsort` method, and then accessing each element by the array of sorted indices.

In [0]:
sorted_cats = category_counts[np.argsort(category_counts[:, 1])][::-1]

sorted_cats[:3]

array([['Pizza', 661],
       ['Chinese', 300],
       ['Italian', 285]], dtype=object)

Finally, we saw how we can coerce values, by selecting and then reassigning the value.

In [0]:
chinese_regional = ['Szechuan', 'Cantonese', 'Shanghainese', 'Taiwanese']

lunch_np[np.isin(lunch_np[:, 3], chinese_regional)] = 'Chinese'

And then we saw that we can also find and replace using the `np.where` method.

In [0]:
arr = np.arange(0, 5)

np.where(arr < 2, 'under 2', '2 or over')

array(['under 2', 'under 2', '2 or over', '2 or over', '2 or over'],
      dtype='<U9')