# Pandas Series Data Frames

By the end of this lesson, students will be able to:

- Explain the relation between pandas.Series and pandas.DataFrame
- Construct simple pandas.Series and pandas.DataFrame from scratch using different initalization methods
- Perform simple operations on pandas.Series
- Navigate the pandas documentation to look for attributes and methods of pandas.Series and pandas.DataFrame


In [1]:
# Import packages
import pandas as pd
import numpy as np

The first core object of pandas is the series. A series is a one-dimensional array of indexed data.

A pandas.Series having an index is the main difference between a pandas.Series and a NumPy array. Let’s see the difference:

In [42]:
# A numpy array
arr = np.random.randn(4) # Random values from std normal distribution

# Print class of arr
print(type(arr))

# Print the values in the numpy array
print(arr, "\n") 

<class 'numpy.ndarray'>
[ 0.98957902  1.84550021 -0.76688108 -0.31963727] 



In [43]:
# A pandas series made from the previous array
s = pd.Series(arr)
print(type(s))
print(s) 

# Notice the index is printed with the pandas series

<class 'pandas.core.series.Series'>
0    0.989579
1    1.845500
2   -0.766881
3   -0.319637
dtype: float64


# The basic method to create a pandas Series
`s = pd.Series(data, index = index)`

The data parameter can be:
- a list or NumPy array,
- a Python dictionary, or
- a single number, boolean (True/False), or string.

The index parameter is optional. It must be a list of indices of the same length as data parameter.


## Example : Creating a `pandas.Series` from a NumPy array

In [12]:
# A series from a numpy array 
pd.Series(np.arange(3), index=[2023, 2024, 2025])

2023    0
2024    1
2025    2
dtype: int64

## Example: Creating a `pandas.Series` from a list

In [14]:
# A series from a list of strings with default index
pd.Series(['EDS 220', 'EDS 222', 'EDS 223', 'EDS 242'])

0    EDS 220
1    EDS 222
2    EDS 223
3    EDS 242
dtype: object

## Example: Creating a `pandas.Series` from a dictionary

In [16]:
# Construct dictionary
d = {'key_0':2,
    'key_1':'3',
    'key_2':5}
# Note: this will create a series with a data type value of 'object'
# This is a mix of strings and numbers

# Initialize series using a dictionary
pd.Series(d)

key_0    2
key_1    3
key_2    5
dtype: object

## Example: Creating a `pandas.Series` from a single value

If you only provide a number, boolean, or string for the series, you have to provide an index.

In [17]:
pd.Series(3.0, index = ['A', 'B', 'C'])

A    3.0
B    3.0
C    3.0
dtype: float64

### Simple operations

In [41]:
# Define a series
s = pd.Series([98, 73, 65], index = ['Andrea', 'Beth', 'Carolina'])

# Divide each element in series by 10
print(s / 10, '\n')

# Take the expontential of each element in series
print(np.exp(s), '\n')

# Original series is unchanged
print(s)

Andrea      9.8
Beth        7.3
Carolina    6.5
dtype: float64 

Andrea      3.637971e+42
Beth        5.052394e+31
Carolina    1.694889e+28
dtype: float64 

Andrea      98
Beth        73
Carolina    65
dtype: int64


In [21]:
# We can also apply boolean logic to the series
s > 70

Andrea       True
Beth         True
Carolina    False
dtype: bool

## Identifying missing values

In [23]:
# Series with NAs in it
# np.nan is a float value with stands for "not a number"
s = pd.Series([1, 2, np.nan, 4, np.nan])
s

# data type is still a float64

0    1.0
1    2.0
2    NaN
3    4.0
4    NaN
dtype: float64

In [24]:
# Check if series has NAs
s.hasnans

True

In [25]:
# Which elements in the series are NAs
s.isna()

0    False
1    False
2     True
3    False
4     True
dtype: bool

## Check in:
-999 is often used to represent missing values. Create a pandas series with four integer values, two of which are -999. The index is A through D.

Look for the method `mask()`. Use this method to update the series so that -999 values are replaces with NAs.

In [34]:
# Create a pandas series with -999 values
s = pd.Series([10, -999, 20, -999], index=['A', 'B', 'C', 'D'])
print(s)

# Use mask() to update the series 
s.mask(s < 0)

A     10
B   -999
C     20
D   -999
dtype: int64


A    10.0
B     NaN
C    20.0
D     NaN
dtype: float64

# Creating a `pandas.DataFrame`

Each column of a `pandas.DataFrame` is a `pandas.Series`

In [36]:
# Initialize dictionary with columns' data
d = {'col_name_1': pd.Series(np.arange(3)),
    'col_name_2': pd.Series([3.1, 3.2, 3.3]),
    }

# Create a data frame
df = pd.DataFrame(d)
df

Unnamed: 0,col_name_1,col_name_2
0,0,3.1
1,1,3.2
2,2,3.3


In [37]:
# Change index
df.index = ['a', 'b', 'c']
df

Unnamed: 0,col_name_1,col_name_2
a,0,3.1
b,1,3.2
c,2,3.3


## Check in

Update the column names to C1 and C2 by updating the columns attribute.

In [40]:
df.columns = ['C1', 'C2']
df

Unnamed: 0,C1,C2
a,0,3.1
b,1,3.2
c,2,3.3


# Summary

Most important takeaways:

- A pandas.Series has a bit more utility than a NumPy array, for example with accessing the index
- You can make a pandas.Series out of a variety of input data values and types
- You can apply operations on a pandas.Series and it will iterate through the whole series and provide a new Series with resulting values
- pandas.DataFrames contain pandas.Series as columns