# Pandas Series and Data Frames

Date: 09/30/2024

[link to exercises](https://meds-eds-220.github.io/MEDS-eds-220-course/book/chapters/lesson-2-series-dataframes.html)

In [3]:
# Import packages
import pandas as pd
import numpy as np

## Series
- one-dimensional array of indexed data
- Index differentiates `pandas.Series` from Numpy Array

In [2]:
# Numpy array
array = np.random.rand(4)
print(type(array), "\n", array)

# Pandas Series using array
series = pd.Series(array)
print(type(series), series)

<class 'numpy.ndarray'> 
 [0.89317834 0.60565902 0.92206991 0.91785356]
<class 'pandas.core.series.Series'> 0    0.893178
1    0.605659
2    0.922070
3    0.917854
dtype: float64


### Create a `pandas.Series`

In [3]:
# General format: s = pd.Series(data, index=index)

# Make a pandas.Series from NumPy array
pd.Series(np.arange(3), index=[2023, 2024, 2025])

2023    0
2024    1
2025    2
dtype: int64

In [4]:
# Make a pandas.Series from list
pd.Series(['EDS 220', 'EDS 222', 'EDS 223', 'EDS 242'])

0    EDS 220
1    EDS 222
2    EDS 223
3    EDS 242
dtype: object

In [5]:
# Make a pandas.Series from dictionary

## Make dictionary

dict = {'key_0':2, 'key_1':'3', 'key_2':5}

## Make series

pd.Series(dict)

key_0    2
key_1    3
key_2    5
dtype: object

In [6]:
# Make a pandas.Series from 1 value
pd.Series(3.0, index = ['A', 'B', 'C'])

A    3.0
B    3.0
C    3.0
dtype: float64

## Simple Operations

- works on series and most NumPy functions

In [7]:
# Define Series
s = pd.Series([98,73,65],index=['Andrea', 'Beth', 'Carolina'])

# Divide each # by 10
print(s /10, '\n')

# Take the exponential of each #
print(np.exp(s), '\n')

# Original series is unchanged
print(s)

Andrea      9.8
Beth        7.3
Carolina    6.5
dtype: float64 

Andrea      3.637971e+42
Beth        5.052394e+31
Carolina    1.694889e+28
dtype: float64 

Andrea      98
Beth        73
Carolina    65
dtype: int64


In [8]:
# Boolean
s > 70

Andrea       True
Beth         True
Carolina    False
dtype: bool

### Identifying Missing Values
- represent a missing, NULL, or NA value with the float value `numpy.nan` (not a number)

In [9]:
# Make Series with NAs in it
s = pd.Series([1, 2, np.nan, 4, np.nan])
print(s)

# check for any NAs (true = has NA)
s.hasnans

# determine WHICH are NAs
s.isna()

0    1.0
1    2.0
2    NaN
3    4.0
4    NaN
dtype: float64


0    False
1    False
2     True
3    False
4     True
dtype: bool

## Check in
- The integer number -999 is often used to represent missing values. Create a pandas.Series named s with four integer values, two of which are -999. The index of this series should be the the letters A through D.
- In the pandas.Series documentation, look for the method mask(). Use this method to update the series s so that the -999 values are replaced by NA values. HINT: check the first example in the method’s documentation.

In [13]:
# 1
s = pd.Series([-999,-999,4,39], index=['A', 'B', 'C', 'D'])
print(s)

A   -999
B   -999
C      4
D     39
dtype: int64


In [15]:
# 2

# `mask`: Series.mask(cond, other=<no_default>, *, inplace=False, axis=None, level=None)
# Replace values where the condition is True.
s = s.mask(s < 0)
# OR s = s.mask(s == -999)

# Data Frames
- Represents tabular data, like a spreadsheet
- Each column of `pandas.DataFrame` is a `pandas.Series`

In [51]:
# Make a dataframr through dictionary
d = {'col_name_1' : pd.Series(np.arange(3)),
     'col_name_2' : pd.Series([3.1, 3.2, 3.3]),
     }

# Make the data frame
df = pd.DataFrame(d)
print(df)

# Change index
df.index = ['a','b','c']
print(df)

   col_name_1  col_name_2
0           0         3.1
1           1         3.2
2           2         3.3
   col_name_1  col_name_2
a           0         3.1
b           1         3.2
c           2         3.3


## Check in

- We can access the data frame’s column names via the columns attribute. Update the column names to C1 and C2 by updating this attribute.

In [62]:
df.columns = ['C1', 'C2']
df.columns
print(df)

   C1   C2
a   0  3.1
b   1  3.2
c   2  3.3
