<a href="https://colab.research.google.com/github/paiml/python_for_datascience/blob/master/Lesson13_Python_For_Data_Science_Sorting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson 13 Sorting


## 13.1 Sort in python

### Understanding Sorting

Python has powerful built-in sorting


#### World Food Facts DataSet 

* Original Data Source:  https://www.kaggle.com/openfoodfacts/world-food-facts
* Modified Source:  https://www.kaggle.com/lwodarzek/nutrition-table-clustering/output

##### Ingest

In [0]:
import pandas as pd

In [4]:
df = pd.read_csv(
    "https://raw.githubusercontent.com/noahgift/food/master/data/features.en.openfoodfacts.org.products.csv")
df.drop(["Unnamed: 0", "exceeded", "g_sum", "energy_100g"], axis=1, inplace=True) #drop two rows we don't need
df = df.drop(df.index[[1,11877]]) #drop outlier
df.rename(index=str, columns={"reconstructed_energy": "energy_100g"}, inplace=True)
df.head()

Unnamed: 0,fat_100g,carbohydrates_100g,sugars_100g,proteins_100g,salt_100g,energy_100g,product
0,28.57,64.29,14.29,3.57,0.0,2267.85,Banana Chips Sweetened (Whole)
2,57.14,17.86,3.57,17.86,1.22428,2835.7,Organic Salted Nut Mix
3,18.75,57.81,15.62,14.06,0.1397,1953.04,Organic Muesli
4,36.67,36.67,3.33,16.67,1.60782,2336.91,Zen Party Mix
5,18.18,60.0,21.82,14.55,0.02286,1976.37,Cinnamon Nut Granola


#### Using built-in sorting

Convert Pandas DataFrame Columns into a list

In [0]:
food_facts = list(df.columns.values)
food_facts

['fat_100g',
 'carbohydrates_100g',
 'sugars_100g',
 'proteins_100g',
 'salt_100g',
 'energy_100g',
 'product']

##### Alphabetical Sort

In [0]:
sorted(food_facts)

['carbohydrates_100g',
 'energy_100g',
 'fat_100g',
 'product',
 'proteins_100g',
 'salt_100g',
 'sugars_100g']

##### Reverse Alphabetical Sort

In [0]:
sorted(food_facts, reverse=True)

['sugars_100g',
 'salt_100g',
 'proteins_100g',
 'product',
 'fat_100g',
 'energy_100g',
 'carbohydrates_100g']

##### Using built in list sort

Only works on a list

In [0]:
food_facts = list(df.columns.values)
print(f"Before sort: {food_facts}")
food_facts.sort()
print(f"After sort: {food_facts}")


Before sort: ['fat_100g', 'carbohydrates_100g', 'sugars_100g', 'proteins_100g', 'salt_100g', 'energy_100g', 'product']
After sort: ['carbohydrates_100g', 'energy_100g', 'fat_100g', 'product', 'proteins_100g', 'salt_100g', 'sugars_100g']


##### Timing built-in sort function vs list sort method

list method

In [0]:
food_facts = list(df.columns.values)

In [0]:
%%timeit -n 3 -r 3
food_facts.sort()



3 loops, best of 3: 307 ns per loop


built in function

In [0]:
food_facts = list(df.columns.values)

In [0]:
%%timeit -n 3 -r 3
sorted(food_facts)

3 loops, best of 3: 513 ns per loop


#### Sorting Dictionary

sorting a dictionary

In [0]:
food_facts_row = df.head(1).to_dict()
food_facts_row

{'carbohydrates_100g': {'0': 64.29},
 'energy_100g': {'0': 2267.85},
 'fat_100g': {'0': 28.57},
 'product': {'0': 'Banana Chips Sweetened (Whole)'},
 'proteins_100g': {'0': 3.57},
 'salt_100g': {'0': 0.0},
 'sugars_100g': {'0': 14.29}}

reverse sort dictionary

In [0]:
sorted(food_facts_row, reverse=True)

['sugars_100g',
 'salt_100g',
 'proteins_100g',
 'product',
 'fat_100g',
 'energy_100g',
 'carbohydrates_100g']

In [0]:
df["product"].head().values

array(['Banana Chips Sweetened (Whole)', 'Organic Salted Nut Mix',
       'Organic Muesli', 'Zen Party Mix', 'Cinnamon Nut Granola'],
      dtype=object)

#### Sorting A Generator Pipeline

In [0]:
def dataframe_rows(df=df, column="product", chunks=10):
  
    count_row = df.shape[0]
    rows = list(df[column].values)
    for i in range(0, count_row, chunks):
      yield rows[i:i + chunks]
    
    

In [6]:
rows = dataframe_rows()
next(rows)


['Banana Chips Sweetened (Whole)',
 'Organic Salted Nut Mix',
 'Organic Muesli',
 'Zen Party Mix',
 'Cinnamon Nut Granola',
 'Organic Hazelnuts',
 'Organic Oat Groats',
 'Energy Power Mix',
 'Antioxidant Mix - Berries & Chocolate',
 'Organic Quinoa Coconut Granola With Mango']

In [7]:
next(rows)

['Fire Roasted Hatch Green Chile Almonds',
 'Peanut Butter Power Chews',
 'Organic Unswt Berry Coconut Granola',
 'Roasted Salted Black Pepper Cashews',
 'Thai Curry Roasted Cashews',
 'Wasabi Tamari Almonds',
 'Organic Red Quinoa',
 'Dark Chocolate Coconut Chews',
 'Organic Unsweetened Granola, Cinnamon Almond',
 'Organic Blueberry Almond Granola']

In [8]:
sorted_row = (sorted(row) for row in rows )
print(next(sorted_row))

['35% Fruit And Fiber Muesli', "Aunt Ginger's Snappy Granola", 'Coconut Almond Granola', 'Dark Chocolate Sea Salt & Turbinado Almonds', 'Maple Almond Granola', 'Organic Coconut Chips', 'Organic Garbanzo Beans', 'Organic Yellow Split Peas', 'Super Nutty Granola', 'Tricolor Tortellini']


## 13.2 Create custom sorting functions

### Building a Shuffle Function

In [0]:
food_items = ['Chocolate Nut Crunch', 'Cranberries', 'Curry Lentil Soup Mix', 
                'Milk Chocolate Peanut Butter Malt Balls', 'Organic Harvest Pilaf', 
                'Organic Tamari Pumpkin Seed', 'Split Pea Soup Mix', 
                'Swiss-Style Muesli', "Whole Wheat 'N Honey Fig Bars", 
                'Yogurt Pretzels']


In [0]:
from random import sample

def shuffle_list(items):
  """Randomly Shuffles List"""
  
  shuffled = sample(items, len(items))
  return shuffled
  

In [0]:
shuffled_food_items = shuffle_list(food_items)
shuffled_food_items

["Whole Wheat 'N Honey Fig Bars",
 'Organic Harvest Pilaf',
 'Chocolate Nut Crunch',
 'Organic Tamari Pumpkin Seed',
 'Milk Chocolate Peanut Butter Malt Balls',
 'Yogurt Pretzels',
 'Split Pea Soup Mix',
 'Swiss-Style Muesli',
 'Curry Lentil Soup Mix',
 'Cranberries']

### Custom Sort Functions

#### Highly Customized Sort

In [0]:
def best_snack(item):
  if item == "Chocolate Nut Crunch":
    return 1
  return len(item) 

sorted(shuffled_food_items, key=best_snack)

['Chocolate Nut Crunch',
 'Cranberries',
 'Yogurt Pretzels',
 'Split Pea Soup Mix',
 'Swiss-Style Muesli',
 'Organic Harvest Pilaf',
 'Curry Lentil Soup Mix',
 'Organic Tamari Pumpkin Seed',
 "Whole Wheat 'N Honey Fig Bars",
 'Milk Chocolate Peanut Butter Malt Balls']

#### Sorting Objects

In [0]:
class Food:
  def __init__(self, product, protein):
    self.product = product
    self.protein = protein
  def __repr__(self):
    return f"Food: {self.product}, Protein: {self.protein}"

In [29]:
pairs = df[["product", "proteins_100g"]].head().values.tolist()
pairs

[['Banana Chips Sweetened (Whole)', 3.57],
 ['Organic Salted Nut Mix', 17.86],
 ['Organic Muesli', 14.06],
 ['Zen Party Mix', 16.67],
 ['Cinnamon Nut Granola', 14.55]]

In [31]:
pairs = df[["product", "proteins_100g"]].head().values.tolist()
foods = [Food(item[0], item[1]) for item in pairs]
foods

[Food: Banana Chips Sweetened (Whole), Protein: 3.57,
 Food: Organic Salted Nut Mix, Protein: 17.86,
 Food: Organic Muesli, Protein: 14.06,
 Food: Zen Party Mix, Protein: 16.67,
 Food: Cinnamon Nut Granola, Protein: 14.55]

In [32]:
sorted(foods, key=lambda food: food.protein)


[Food: Banana Chips Sweetened (Whole), Protein: 3.57,
 Food: Organic Muesli, Protein: 14.06,
 Food: Cinnamon Nut Granola, Protein: 14.55,
 Food: Zen Party Mix, Protein: 16.67,
 Food: Organic Salted Nut Mix, Protein: 17.86]

## 13.3 Sort in pandas

### Sort by One Column:  Protein

In [0]:
df.sort_values(by=["proteins_100g"], ascending=False).head(10)

Unnamed: 0,fat_100g,carbohydrates_100g,sugars_100g,proteins_100g,salt_100g,energy_100g,product
2377,0.0,0.0,0.0,100.0,0.0,1700.0,Unflavored Gelatin
37027,0.0,0.0,0.0,100.0,0.36322,1700.0,Unflavored Gelatin
16674,6.82,22.73,13.64,86.36,14.77772,2120.51,"Fisherman's Wharf, Cocktail Shrimp"
37415,3.33,6.67,3.33,83.33,0.67818,1659.87,"Whey & Soy Protein, Flavored Drink Mix, Vanilla"
133,4.6,8.8,6.0,78.05,1.21158,1655.85,Whey Protein aus Molke 500 Gramm Vanilla
131,4.6,8.8,6.0,78.05,1.21158,1655.85,Whey Protein aus Molke 1000 Gramm Vanilla
129,4.6,8.8,6.0,78.05,1.21158,1655.85,Whey Protein aus Molke Vanilla
33115,1.67,13.33,0.0,76.67,0.0,1595.13,Vital Wheat
37392,6.25,8.33,4.17,75.0,0.635,1660.36,"Whey Protein Powder, Chocolate"
16669,5.36,21.43,14.29,67.86,12.79144,1726.97,"Fisherman's Wharf, Cocktail Shrimp"


### Sort by Two Columns:  Sugar, Salt

In [0]:
df.sort_values(by=["sugars_100g", "salt_100g"], ascending=[False, False]).head(10)

Unnamed: 0,fat_100g,carbohydrates_100g,sugars_100g,proteins_100g,salt_100g,energy_100g,product
33151,0.0,0.0,100.0,0.0,71.12,0.0,"Turkey Brine Kit, Garlic & Herb"
24783,0.0,100.0,100.0,0.0,24.13,1700.0,Seasoning
4073,0.0,100.0,100.0,0.0,7.62,1700.0,"Seasoning Rub, Sweet & Spicy Seafood"
10282,0.0,100.0,100.0,0.0,2.54,1700.0,Instant Pectin
17880,0.0,100.0,100.0,0.0,0.635,1700.0,Cranberry Cosmos Cocktail Rimming Sugar
8822,0.0,100.0,100.0,0.0,0.5588,1700.0,"Alaga, The Original Cane Flavor Syrup, Cane"
8823,0.0,100.0,100.0,0.0,0.5588,1700.0,The Original Cane Syrup
41157,0.0,100.0,100.0,0.0,0.3175,1700.0,Panela Brown Sugar Cane
41158,0.0,100.0,100.0,0.0,0.3175,1700.0,Panela Brown Sugar Cane
41159,0.0,100.0,100.0,0.0,0.3175,1700.0,Panela


### Groupby

In [39]:
def high_protein(row):
  """Creates a high or low protein category"""
  
  if row > 80:
    return "high_protein"
  return "low_protein"

df["high_protein"] = df["proteins_100g"].apply(high_protein)
df.head()

Unnamed: 0,fat_100g,carbohydrates_100g,sugars_100g,proteins_100g,salt_100g,energy_100g,product,high_protein
0,28.57,64.29,14.29,3.57,0.0,2267.85,Banana Chips Sweetened (Whole),low_protein
2,57.14,17.86,3.57,17.86,1.22428,2835.7,Organic Salted Nut Mix,low_protein
3,18.75,57.81,15.62,14.06,0.1397,1953.04,Organic Muesli,low_protein
4,36.67,36.67,3.33,16.67,1.60782,2336.91,Zen Party Mix,low_protein
5,18.18,60.0,21.82,14.55,0.02286,1976.37,Cinnamon Nut Granola,low_protein


In [41]:
df.groupby("high_protein").median()

Unnamed: 0_level_0,fat_100g,carbohydrates_100g,sugars_100g,proteins_100g,salt_100g,energy_100g
high_protein,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
high_protein,1.665,3.335,1.665,93.18,0.5207,1700.0
low_protein,3.17,22.39,5.88,4.0,0.635,1121.54


## Notes

Similar to Notes section of Powerpoint (where we can exchange ideas)

* We may want Lesson 1 to be PowerPoint only