# Method Chaining with NumPy and Intro to Pandas
If you want to type along with me, use [this notebook](https://humboldt.cloudbank.2i2c.cloud/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fbethanyj0%2Fdata271_sp25&branch=main&urlpath=tree%2Fdata271_sp25%2Flectures%2Fdata271_lec11_live.ipynb) instead. 
If you don't want to type and want to follow along just by executing the cells, stay in this notebook. 

In [None]:
# Whenever you want to use numpy import it with the following code
import numpy as np

### Method chaining

In [None]:
arr2d = np.random.randint(0,20,(5,4))
arr2d

In [None]:
# start by reshaping
arr2d.reshape((2,10))

In [None]:
# chain: reshape and then take the max of each column
arr2d.reshape((2,10)).max(axis=0)

In [None]:
# chain: reshape and then take the max of each column then get the average of those
arr2d.reshape((2,10)).max(axis=0).mean()

In [None]:
# when your chains start getting long 
(arr2d.reshape((2,10))
 .max(axis=0)
 .mean())

In [None]:
# reminder that this doesn't change the original array
arr2d

### Example

In [9]:
import pandas as pd
import ast

recipes = pd.read_csv("recipes_time_int.csv")
print(recipes.head(3))

# convert nutrition_dict from string -> real actual and literal dict
recipes["nutrition_dict"] = recipes["nutrition_dict"].apply(ast.literal_eval)

# get calories
#   calories are hidden in the column 'nutrition_dict' as a key in the dictionary
#   have to account if key 'calories' is missing so we use
#          dict.get(key)
recipes["calories"] = [d.get("calories") for d in recipes["nutrition_dict"]]

# make sure the ratings and calories are numeric
recipes["rating_value"] = pd.to_numeric(recipes["rating_value"], errors="coerce")
recipes["calories"] = pd.to_numeric(recipes["calories"], errors="coerce")

# summary: mean calories + mean rating by tax1, sorted by mean rating
tax1_summary = (
    recipes
      .dropna(subset=["tax1", "rating_value", "calories"])
      .groupby("tax1")
      .agg(
          mean_calories=("calories", "mean"),
          mean_rating=("rating_value", "mean"),
          n_recipes=("recipe_id", "size")
      )
      .sort_values("mean_rating", ascending=False)
)

tax1_summary


   recipe_id                          name  \
0       6770         Pumpkin Coconut Bread   
1       6672              Strawberry Bread   
2     234425  How to Make Cronuts, Part II   

                                         description  \
0  This pumpkin coconut bread is a wonderful brea...   
1  This strawberry bread with frozen strawberries...   
2  Part I of this series showed you how to make t...   

                  date_published  rating_value  rating_count cook_time  \
0  1998-07-06T00:27:23.000-04:00           4.7          81.0     PT60M   
1  1998-05-07T15:22:52.000-04:00           4.0          49.0     PT60M   
2  2013-09-09T15:45:17.000-04:00           4.9          13.0     PT20M   

                                           nutrition prep_time total_time  \
0  {'calories': '296 kcal', 'carbohydrateContent'...     PT10M      PT70M   
1  {'calories': '551 kcal', 'carbohydrateContent'...     PT15M      PT75M   
2                                                 {}     PT60M

Unnamed: 0_level_0,mean_calories,mean_rating,n_recipes
tax1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ingredients,38.352941,4.758824,17
HolidaysandEvents,632.648649,4.732432,37
TrustedBrands,450.25,4.725,12
USRecipes,82.5,4.7,2
BBQandGrilling,296.4,4.688571,35
FruitsVegetablesandOtherProduce,257.909091,4.618182,33
SoupsStewsandChili,333.664459,4.600442,453
PastaandNoodles,541.304348,4.591304,23
SaladRecipes,296.01737,4.577667,403
Drinks,265.032258,4.577419,31


## Activity

There are two arrays below:

`restaurant_items` is a 2D array:
- Row 1: appetizers
- Row 2: main dishes
- Row 3: beverages

`prices` is a 2D array of the same shape as `restaurant_items`. It contains the prices of each corresponding resturant item. 

Use these to answer the following questions.

In [None]:
restaurant_items = np.array([['fries','salad','soup'],
                  ['pizza','pasta','burger'],
                  ['soda','iced tea','lemonade']])
prices = np.array([[6.50,7.25,4.75],
                  [9.50,10.25,10.75],
                  [2,2.25,3]])

1. Use NumPy methods to determine how much would someone pay for their meal if they ordered the most expensive appetizer, main dish, and beverage? 

In [None]:
prices.max(axis=1).sum()

2. Use NumPy methods and slicing to create a 1d array containing all the resturaunt items sorted by most expensive to least expensive. 

In [None]:
restaurant_items.ravel()[prices.ravel().argsort(axis=None)[::-1]]

3. Use NumPy methods and slicing to create a 1d array containing the most expensive appetizer, the most expensive main dish, and the most expensive beverage. 

In [None]:
restaurant_items[[0,1,2],prices.argmax(axis=1)]

## Pandas

In [None]:
# whenever we want to use Pandas
import pandas as pd

### Creating Pandas Series

In [None]:
# Create a series from array
arr_series = pd.Series(arr2d.ravel()[:5])
arr_series

In [None]:
# Create a series from list
lst = ['data','science','math']
lst_series = pd.Series(lst)
lst_series

In [None]:
# Create a series from tuple
tup = (2,3,5)
tup_series = pd.Series(tup)
tup_series

In [None]:
# Create a series from dict (keys become index, values become data)
dct = {"one": 1, "two": 2, "three": 3}
dct_series = pd.Series(dct)
dct_series

In [None]:
# Specify indicies when creating a series
tup_series_w_index = pd.Series(tup, index = ['two','three','five'])
tup_series_w_index

In [None]:
# Specify indicies after
tup_series.index = ['two','three','five']
tup_series

In [None]:
# Can also name your series
tup_series_w_index_name = pd.Series(tup, index = ['two','three','five'],name='nums')
tup_series_w_index_name

### Series attributes

In [None]:
arr_series.dtype

In [None]:
arr_series.shape

In [None]:
arr_series.index

In [None]:
tup_series_w_index_name.index

In [None]:
arr_series.values

In [None]:
tup_series_w_index_name.name

### Accessing Elements

In [None]:
dct_series

In [None]:
# Access elements by their index (bracket notation)
dct_series['one']

In [None]:
# Access elements by their index (dot/attribute notation)
dct_series.one

In [None]:
# Accessing elements by their position
dct_series[0]

In [None]:
# Slicing by index (inclusive stop)
dct_series['one':'three']

In [None]:
# Slicing by position (exclusive stop)
dct_series[0:2]

In [None]:
# These different ways of accessing elements can get confusing if indices are ints
new_series = pd.Series({1:1,2:2,3:3,4:4,5:5})
new_series

In [None]:
# Are we accessing by index or position here?
new_series[1]

In [None]:
# Are we slicing by index or position here?
new_series[1:3]

Be explicit with `.loc` (for index-based access) and `.iloc` (for position-based access)

In [None]:
# Access by index
new_series.loc[1]

In [None]:
# Access by position
new_series.iloc[1]

In [None]:
# Slice by index
new_series.loc[1:3]

In [None]:
# Slice by position
new_series.iloc[1:3]

### Pandas dataframes

In [None]:
my_dict = {'fruit':['apple','banana','orange'],
          'color':['red','yellow','orange'],
          'yum score':[5,5,5],
          'in fridge':[True, False, True],
          'number':[3,4,0]}

In [None]:
fruit_df = pd.DataFrame(my_dict)
fruit_df

In [None]:
list_of_tups = [(i,i**2,i**3) for i in range(10)]
list_of_tups

In [None]:
squares_and_cubes = pd.DataFrame(list_of_tups,columns = ['n','squared','cubed'])
squares_and_cubes

##### Dataframe attributes

In [None]:
fruit_df.dtypes

In [None]:
fruit_df.shape

In [None]:
fruit_df.index

In [None]:
fruit_df.values

In [None]:
fruit_df.columns