# Intro to Pandas
If you want to type along with me, use [this notebook](https://humboldt.cloudbank.2i2c.cloud/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fbethanyj0%2Fdata271_sp25&branch=main&urlpath=tree%2Fdata271_sp25%2Flectures%2Fdata271_lec12_live.ipynb) instead. 
If you don't want to type and want to follow along just by executing the cells, stay in this notebook.

In [80]:
# Standard imports
import numpy as np
import pandas as pd

### Pandas Series

In [81]:
# Create a series from a dict (keys become indicies)
dct = {'one': 1, 'two': 2,'three':3}
dct_series = pd.Series(dct)
dct_series

one      1
two      2
three    3
dtype: int64

### Accessing Elements

In [82]:
dct_series

one      1
two      2
three    3
dtype: int64

In [83]:
# Access elements by their index (bracket notation)
dct_series['one']

np.int64(1)

In [84]:
# Access elements by their index (dot/attribute notation)
dct_series.one

np.int64(1)

In [85]:
# Accessing elements by their index position...error :(
dct_series[0]

# question, what should we use instead?

KeyError: 1

In [7]:
# Slicing by index (inclusive stop)
dct_series['one':'three']

one      1
two      2
three    3
dtype: int64

In [8]:
# Slicing by position (exclusive stop)
dct_series[0:2]

one    1
two    2
dtype: int64

In [9]:
# These different ways of accessing elements can get confusing if indices are ints
new_series = pd.Series({1:1,2:2,3:3,4:4,5:5})
new_series

1    1
2    2
3    3
4    4
5    5
dtype: int64

In [10]:
# Are we accessing by index or position here?
new_series[1]

np.int64(1)

In [11]:
# Are we slicing by index or position here?
new_series[1:3]

2    2
3    3
dtype: int64

Be explicit with `.loc` (for index-based access) and `.iloc` (for position-based access)

In [12]:
# Access by index
new_series.loc[1]

np.int64(1)

In [13]:
# Access by position
new_series.iloc[1]

np.int64(2)

In [14]:
# Slice by index
new_series.loc[1:3]

1    1
2    2
3    3
dtype: int64

In [15]:
# Slice by position
new_series.iloc[1:3]

2    2
3    3
dtype: int64

### Advanced indexing

In [16]:
# Select multiple elements by index
new_series.loc[[1,3]]

1    1
3    3
dtype: int64

In [17]:
# Select multiple elements by position
new_series.iloc[[1,3]]

2    2
4    4
dtype: int64

In [18]:
# Supports Boolean indexing
new_series[new_series % 2 == 0]

2    2
4    4
dtype: int64

In [19]:
# Conditional indexing
new_series[(new_series > 2) & (new_series < 5)]

3    3
4    4
dtype: int64

In [20]:
# Another way to do conditional indexing (specify inclusive as 'both','neither','left', or 'right')
new_series[new_series.between(2,5,inclusive='neither')]

3    3
4    4
dtype: int64

In [21]:
# Select specific elements
new_series[(new_series == 1) | (new_series == 4)]

1    1
4    4
dtype: int64

In [22]:
# Another way to select specific elements
new_series[new_series.isin([1,4])]

1    1
4    4
dtype: int64

### General Series Info

In [24]:
# general info about the series
dct_series.info()

<class 'pandas.Series'>
Index: 3 entries, one to three
Series name: None
Non-Null Count  Dtype
--------------  -----
3 non-null      int64
dtypes: int64(1)
memory usage: 156.0+ bytes


### Creating Pandas DataFrames

In [25]:
my_dict = {'fruit':['apple','banana','orange'],
          'color':['red','yellow','orange'],
          'yum_score':[5,5,5],
          'in fridge':[True, False, True]}

In [26]:
# dataframe from a dictionary (treats each key as a column)
fruit_df = pd.DataFrame(my_dict)
fruit_df

Unnamed: 0,fruit,color,yum_score,in fridge
0,apple,red,5,True
1,banana,yellow,5,False
2,orange,orange,5,True


In [27]:
# dataframe from a dictionary (specify row labels)
fruit_df = pd.DataFrame(my_dict,index = np.arange(1,4))
fruit_df

Unnamed: 0,fruit,color,yum_score,in fridge
1,apple,red,5,True
2,banana,yellow,5,False
3,orange,orange,5,True


In [28]:
# To change column labels
fruit_df.columns = ['fruit','color','yum_score','in_fridge']
fruit_df

Unnamed: 0,fruit,color,yum_score,in_fridge
1,apple,red,5,True
2,banana,yellow,5,False
3,orange,orange,5,True


In [29]:
# nested lists (or lists of tups)
lists = [[i,i**2,i**3] for i in range(10)]
lists

[[0, 0, 0],
 [1, 1, 1],
 [2, 4, 8],
 [3, 9, 27],
 [4, 16, 64],
 [5, 25, 125],
 [6, 36, 216],
 [7, 49, 343],
 [8, 64, 512],
 [9, 81, 729]]

In [30]:
# dataframe from a list of lists (treats each sublist as a row)
pd.DataFrame(lists)

Unnamed: 0,0,1,2
0,0,0,0
1,1,1,1
2,2,4,8
3,3,9,27
4,4,16,64
5,5,25,125
6,6,36,216
7,7,49,343
8,8,64,512
9,9,81,729


In [31]:
# specify column names 
pd.DataFrame(lists,columns = ['n','squared','cubed'])

Unnamed: 0,n,squared,cubed
0,0,0,0
1,1,1,1
2,2,4,8
3,3,9,27
4,4,16,64
5,5,25,125
6,6,36,216
7,7,49,343
8,8,64,512
9,9,81,729


In [32]:
# list of dictionaries (with same keys)
list_of_dicts = [
    {'Median Home Price': 454000, 'Town': 'Arcata'},
     {'Median Home Price': 383000, 'Town': 'Eureka'},
]
list_of_dicts

[{'Median Home Price': 454000, 'Town': 'Arcata'},
 {'Median Home Price': 383000, 'Town': 'Eureka'}]

In [33]:
# Dataframe from list of dictionaries (treats each dict value as a row)
pd.DataFrame(list_of_dicts)

Unnamed: 0,Median Home Price,Town
0,454000,Arcata
1,383000,Eureka


### Accessing DataFrame Elements

#### Columns: use `df.col_name` or `df[col_name]`

In [34]:
# accessing columns (dot notation)

fruit_df.color

1       red
2    yellow
3    orange
Name: color, dtype: str

In [35]:
# accessing columns (bracket notation)

fruit_df['color']

1       red
2    yellow
3    orange
Name: color, dtype: str

#### Rows: use `df.loc[index_name]` or `df.iloc[index_value]`

In [52]:
# accessing rows (by index name / label)

fruit_df.loc[2]

fruit        banana
color        yellow
yum_score         5
in_fridge     False
Name: 2, dtype: object

In [37]:
# accessing rows (by index value / position)

fruit_df.iloc[2]

fruit        orange
color        orange
yum_score         5
in_fridge      True
Name: 3, dtype: object

#### Elements: use `df[row, column]`
- if using name/label, use `df.loc[row, column]`
- if using index value/position, use `df.iloc[row, column]`

In [69]:
# accessing elements (by label)

fruit_df.loc[2,'color']

'yellow'

In [70]:
# accessing elements (by position)
fruit_df.iloc[1,1]

'yellow'

#### Slicin n subsetting by column label or index position

In [54]:
# slicing (by label)
fruit_df.loc[1:2,['fruit','color']]

Unnamed: 0,fruit,color
1,apple,red
2,banana,yellow


In [53]:
# slicing (by position)
fruit_df.iloc[0:2,0:2]

Unnamed: 0,fruit,color
1,apple,red
2,banana,yellow


In [42]:
# Subsetting columns (by column label)
fruit_df[['fruit','yum_score']]

Unnamed: 0,fruit,yum_score
1,apple,5
2,banana,5
3,orange,5


In [43]:
# Subsetting columns (by column position)
fruit_df.iloc[:,[0,2]]

Unnamed: 0,fruit,yum_score
1,apple,5
2,banana,5
3,orange,5


### Dataframe attributes

| Attribute / Method        | What It Returns | Other? |
|---------------------------|----------------|-------|
| `df.dtypes` | Series showing each column name and its data type | Useful for checking numeric vs object columns before analysis! |
| `df.shape` | tuple `(rows, columns)` |  |
| `df.index` | Pandas object representing row labels | Convert to list with `list(df.index)` if you need a regular Python list |
| `df.columns` | Pandas object representing column names | Convert to list with `list(df.columns)` |
| `df.values` | numpy array of the DataFrame |  |
| `df.info()` | Pandas DataFrame | column names, null count, dtype! |



In [44]:
# data type of elements in each column
fruit_df.dtypes

fruit          str
color          str
yum_score    int64
in_fridge     bool
dtype: object

In [45]:
# shape (2d)
fruit_df.shape

(3, 4)

In [46]:
# row labels
fruit_df.index

Index([1, 2, 3], dtype='int64')

In [47]:
# column labels
fruit_df.columns

Index(['fruit', 'color', 'yum_score', 'in_fridge'], dtype='str')

In [48]:
# all the values output as numpy array (note dtype)
fruit_df.values

array([['apple', 'red', 5, True],
       ['banana', 'yellow', 5, False],
       ['orange', 'orange', 5, True]], dtype=object)

In [94]:
# general info
fruit_df.info()

<class 'pandas.DataFrame'>
Index: 3 entries, 1 to 3
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   fruit      3 non-null      str  
 1   color      3 non-null      str  
 2   yum_score  3 non-null      int64
 3   in_fridge  3 non-null      bool 
dtypes: bool(1), int64(1), str(2)
memory usage: 207.0 bytes


## Why Pandas?
NumPy is nice for handling homogeneous data types, but sometimes we need more flexibility as data become more complicated. We might also desire visually pleasing way to view the data.  

In [64]:
# Sample data (made up employees)
employee_data = np.array([
    [101, 'John', 'Engineering', 60000, '2018-01-15'],
    [102, 'Jane', 'Engineering', 65000, '2017-05-12'],
    [103, 'Doe', 'HR', 55000, '2019-02-28'],
    [104, 'Alice', 'Marketing', 70000, '2016-11-20'],
    [105, 'Bob', 'HR', 60000, '2019-09-10'],
    [106, 'Eve', 'Marketing', 75000, '2017-04-05']
])

print(f'''What is the data type??
- int64
- float64
- <U21 .....aka string
dtype is: {employee_data.dtype}
''')
employee_data

What is the data type??
- int64
- float64
- <U21 .....aka string
dtype is: <U21



array([['101', 'John', 'Engineering', '60000', '2018-01-15'],
       ['102', 'Jane', 'Engineering', '65000', '2017-05-12'],
       ['103', 'Doe', 'HR', '55000', '2019-02-28'],
       ['104', 'Alice', 'Marketing', '70000', '2016-11-20'],
       ['105', 'Bob', 'HR', '60000', '2019-09-10'],
       ['106', 'Eve', 'Marketing', '75000', '2017-04-05']], dtype='<U21')

In [65]:
# Same data in Pandas dataframe

employee_df = pd.DataFrame(employee_data, columns=['ID', 'Name', 'Department', 'Salary', 'Hire Date'])
employee_df['Salary'] = pd.to_numeric(employee_df['Salary'])

employee_df

Unnamed: 0,ID,Name,Department,Salary,Hire Date
0,101,John,Engineering,60000,2018-01-15
1,102,Jane,Engineering,65000,2017-05-12
2,103,Doe,HR,55000,2019-02-28
3,104,Alice,Marketing,70000,2016-11-20
4,105,Bob,HR,60000,2019-09-10
5,106,Eve,Marketing,75000,2017-04-05


In [66]:
# Get the average salary by department

# Find unique departments
unique_departments = np.unique(employee_data[:, 2])

# Calculate average salary for each department
avg_salaries = []
for department in unique_departments:
    # grouping by department
    department_salaries = employee_data[employee_data[:, 2] == department, 3].astype(float)
    avg_salaries.append(np.mean(department_salaries))

print(unique_departments)
print(avg_salaries)

['Engineering' 'HR' 'Marketing']
[np.float64(62500.0), np.float64(57500.0), np.float64(72500.0)]


In [None]:
# Do the same task with Pandas
avg_salaries = employee_df.groupby('Department')['Salary'].mean()
avg_salaries

## Activity 

Consider the following data:

|Pet name | Species| Age| Adoption Fee|
|---------|--------|----|-------------|
|Whiskers | Cat    | 3  | 25.00       |
|Bubbles  | Fish   | 1  | 3.00        |
| Rover   | Dog    | 2  | 75.00       |
| Hopper  | Bunny  | 2  | 15.00       |


### Create a Pandas dataframe containing the data above. 

#### Answer

In [90]:
pet_dict = {'Pet name': ['Whiskers', 'Bubbles', 'Rover', 'Hopper', 'Numpy', 'Poots'],
           'Species': ['cat', 'fish', 'dog', 'bun', 'cat', 'cat'],
           'Age': [3, 1, 2, 2, 5, 4],
           'Adoption fee': [25.00, 3.00, 75.00, 15.00, 125.00, 100.00]}

pet_df = pd.DataFrame(pet_dict)
pet_df

Unnamed: 0,Pet name,Species,Age,Adoption fee
0,Whiskers,cat,3,25.0
1,Bubbles,fish,1,3.0
2,Rover,dog,2,75.0
3,Hopper,bun,2,15.0
4,Numpy,cat,5,125.0
5,Poots,cat,4,100.0


### Select and display only the `Pet Name` and `Species` columns.

#### Answer

In [91]:
pet_df[['Pet name', 'Species']]

Unnamed: 0,Pet name,Species
0,Whiskers,cat
1,Bubbles,fish
2,Rover,dog
3,Hopper,bun
4,Numpy,cat
5,Poots,cat


### Select and display only the first two rows of the dataframe.

#### Answer

In [92]:
pet_df.iloc[0:2,:]

Unnamed: 0,Pet name,Species,Age,Adoption fee
0,Whiskers,cat,3,25.0
1,Bubbles,fish,1,3.0


### Select and display the pets that have an adoption fee less than $20. 

#### Answer

In [93]:
pet_df[pet_df['Adoption fee']<20]

Unnamed: 0,Pet name,Species,Age,Adoption fee
1,Bubbles,fish,1,3.0
3,Hopper,bun,2,15.0


### Sort based on age

#### Answer

In [96]:
pet_df.sort_values(by='Age')

Unnamed: 0,Pet name,Species,Age,Adoption fee
1,Bubbles,fish,1,3.0
2,Rover,dog,2,75.0
3,Hopper,bun,2,15.0
0,Whiskers,cat,3,25.0
5,Poots,cat,4,100.0
4,Numpy,cat,5,125.0


### Group by species and find the mean adoption fee

#### Answer

In [97]:
pet_df.groupby('Species')['Adoption fee'].mean()

Species
bun     15.000000
cat     83.333333
dog     75.000000
fish     3.000000
Name: Adoption fee, dtype: float64

In [99]:
pet_df.value_counts('Species')

Species
cat     3
fish    1
dog     1
bun     1
Name: count, dtype: int64