# Intro to Pandas
If you want to type along with me, use [this notebook](https://humboldt.cloudbank.2i2c.cloud/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fbethanyj0%2Fdata271_sp24&branch=main&urlpath=tree%2Fdata271_sp24%2Fdemos%2Fdata271_demo12_live.ipynb) instead. 
If you don't want to type and want to follow along just by executing the cells, stay in this notebook. 

In [None]:
# Whenever you want to use numpy import it with the following code
import numpy as np

## Activities from last time

There are two arrays below:

`restaurant_items` is a 2D array. The first row contains appetizers, the second row contains main dishes, and the third row contains beverages. 

`prices` is a 2D array of the same shape as `restaurant_items`. It contains the prices of each resturant item. 

Use these to answer the following questions.

In [None]:
restaurant_items = np.array([['fries','salad','soup'],
                  ['pizza','pasta','burger'],
                  ['soda','iced tea','lemonade']])
prices = np.array([[6.50,7.25,4.75],
                  [9.50,10.25,10.75],
                  [2,2.25,3]])

1. Use NumPy methods to determin how much would someone pay for their meal if they ordered the most expensive appetizer, main dish, and beverage? 

In [None]:
prices.max(axis=1).sum()

2. Use NumPy methods and slicing to create a 1d array containing all the resturaunt items sorted by most expensive to least expensive. 

In [None]:
restaurant_items.ravel()[prices.ravel().argsort()[::-1]]

3. Use NumPy methods and slicing to create a 1d array containing the most expensive appetizer, the most expensive main dish, and the most expensive beverage. 

In [None]:
restaurant_items[(0,1,2),prices.argmax(axis=1)]

## Pandas

In [None]:
# whenever we want to use Pandas
import pandas as pd

### Pandas Series

In [None]:
evens_array = np.arange(1,20,2)
evens_array

In [None]:
# create pandas series from a numpy array
evens_series = pd.Series(evens_array)
evens_series

In [None]:
# make a list
lst = ['data','science','math']

In [None]:
# create a series from a list
pd.Series(lst)

In [None]:
# create a series from a tuple
tup = (2,3,5)
tup_series = pd.Series(tup)
tup_series

In [None]:
# create a series from a dict (keys become indicies)
dct = {'key1': 1, 'key2':2,'key3':3}
dct_series = pd.Series(dct)
dct_series

##### Accessing elements

In [None]:
# access elements by index
dct_series['key2']

In [None]:
# or access by position
dct_series[1]

In [None]:
# or access multiple 
dct_series[['key1','key2']]

In [None]:
# access multiple with position
dct_series[:2]

##### Series attributes

In [None]:
# data type of the elements
evens_series.dtype

In [None]:
# shape (always 1d)
evens_series.shape

In [None]:
# indices in the series
evens_series.index

In [None]:
# values in the series
evens_series.values

In [None]:
# general info about the series
evens_series.info()

In [None]:
# updating indices
evens_series.index = range(1,11)
evens_series

In [None]:
# or specify index when making the series 
pd.Series(evens_array, index = range(1,11))

### Pandas dataframes

In [None]:
my_dict = {'fruit':['apple','banana','orange'],
          'color':['red','yellow','orange'],
          'yum_score':[5,5,5],
          'in fridge':[True, False, True],
          'number':[3,4,0]}

In [None]:
# dataframe from a dictionary (treats each key as a column)
fruit_df = pd.DataFrame(my_dict, index = np.arange(1,4))
fruit_df

In [None]:
# list of tuples
list_of_tups = [(i,i**2,i**3) for i in range(10)]
list_of_tups

In [None]:
# dataframe from a list of tups (treats each tuple as a row)
pd.DataFrame(list_of_tups,columns = ['n','squared','cubed'])

In [None]:
# list of dictionaries
list_of_dicts = [
    {'Median Home Price': 454000, 'Town': 'Arcata'},
     {'Median Home Price': 383000, 'Town': 'Eureka'},
]
list_of_dicts

In [None]:
# Dataframe from list of dictionaries (treats each dict value as a row)
pd.DataFrame(list_of_dicts)

##### Accessing elements

In [None]:
# accessing columns
fruit_df.color

In [None]:
# accessing columns (another way)
fruit_df['color']

In [None]:
# accessing rows (by label)
fruit_df.loc[2]

In [None]:
# accessing rows (by position)
fruit_df.iloc[2]

In [None]:
# accessing elements (by label)
fruit_df.loc[2,'color']

In [None]:
# accessing elements (by position)
fruit_df.iloc[1,1]

In [None]:
# slicing (by label)
fruit_df.loc[1:2,['fruit','color']]

In [None]:
# slicing (by position)
fruit_df.iloc[0:2,0:2]

##### Dataframe attributes

In [None]:
# data type of elements in each column
fruit_df.dtypes

In [None]:
# shape (2d)
fruit_df.shape

In [None]:
# indices of rows
fruit_df.index

In [None]:
# all the values (output as numpy array)
fruit_df.values

In [None]:
# column names
fruit_df.columns

In [None]:
# general info
fruit_df.info()

## Why Pandas?
NumPy is nice for handling homogeneous data types, but sometimes we need more flexibility as data become more complicated. We might also desire visually pleasing way to view the data.  

In [None]:
# Sample data (made up employees)
employee_data = np.array([
    [101, 'John', 'Engineering', 60000, '2018-01-15'],
    [102, 'Jane', 'Engineering', 65000, '2017-05-12'],
    [103, 'Doe', 'HR', 55000, '2019-02-28'],
    [104, 'Alice', 'Marketing', 70000, '2016-11-20'],
    [105, 'Bob', 'HR', 60000, '2019-09-10'],
    [106, 'Eve', 'Marketing', 75000, '2017-04-05']
])
print(employee_data)

# Same data in Pandas dataframe
employee_df = pd.DataFrame(employee_data, columns=['ID', 'Name', 'Department', 'Salary', 'Hire Date'])
employee_df['Salary'] = pd.to_numeric(employee_df['Salary'])
employee_df

In [None]:
# Get the average salary by department

# Find unique departments
unique_departments = np.unique(employee_data[:, 2])

# Calculate average salary for each department
avg_salaries = []
for department in unique_departments:
    department_salaries = employee_data[employee_data[:, 2] == department, 3].astype(float)
    avg_salaries.append(np.mean(department_salaries))

print(unique_departments)
print(avg_salaries)

In [None]:
# Do the same task with Pandas
avg_salaries = employee_df.groupby('Department')['Salary'].mean()
avg_salaries

## Activity 

Consider the following jokes:

1. Q: Why don't scientists trust atoms?
    1. Because they make up everything.
2. Q: What do you call fake spaghetti?
    1. An impasta!
3. Q: Why did the scarecrow win an award?
    1. Because he was outstanding in his field.


Create a Pandas dataframe with the jokes in one column, their answers in another column, and your rating of the joke on a scale of 0-5 stars (ints) in another column. 

Compute your average rating of these jokes.

Access the question and answer of your highest rated joke. (output should be a Pandas df with 1(or more) rows and two columns)