## Pandas series and data frames

In this lesson we introduce the two core objects in the pandas library, the pandas.Series and the pandas.DataFrame. The overall goal is to gain familiarity with these two objects, understand their relation to each other, and review Python data structures such as dictionaries and lists.

By the end of this lesson, students will be able to:

Explain the relation between pandas.Series and pandas.DataFrame
Construct simple pandas.Series and pandas.DataFrame from scratch using different initalization methods
Perform simple operations on pandas.Series
Navigate the pandas documentation to look for attributes and methods of pandas.Series and pandas.DataFrame

In [None]:
import pandas as pd
import numpy as np

### Series

In [4]:
# A numpy array
arr = np.random.randn(4) # random values from std normal distribution
print(type(arr))
print(arr, "\n")

# A pandas series made from the previous array
s = pd.Series(arr)
print(type(s))
print(s)

<class 'numpy.ndarray'>
[ 0.16123086 -0.17893887  0.42500863  0.61875697] 

<class 'pandas.core.series.Series'>
0    0.161231
1   -0.178939
2    0.425009
3    0.618757
dtype: float64


Creating a pandas.Series

The basic method to create a pandas.Series is to call :
s = pd.Series(data, index=index)

The data parameter can be:

a list or NumPy array,
a Python dictionary, or
a single number, boolean (True/False), or string.
The index parameter is optional, if we wish to include it, it must be a list of list of indices of the same length as data.

Example: Creating a pandas.Series from a NumPy arra

In [6]:
# A series from a numpy array 
pd.Series(np.arange(3), index=[2023, 2024, 2025])

2023    0
2024    1
2025    2
dtype: int64

Example: Creating a pandas.Series from a list

In [7]:
# A series from a list of strings with default index
pd.Series(['EDS 220', 'EDS 222', 'EDS 223', 'EDS 242'])

0    EDS 220
1    EDS 222
2    EDS 223
3    EDS 242
dtype: object

In [8]:
# Construct dictionary
d = {'key_0':2, 'key_1':'3', 'key_2':5}

# Initialize series using a dictionary
pd.Series(d)

key_0    2
key_1    3
key_2    5
dtype: object

Example: Creating a pandas.Series from a single value

In [9]:
pd.Series(3.0, index = ['A', 'B', 'C'])

A    3.0
B    3.0
C    3.0
dtype: float64

In [10]:
# Define a series
s = pd.Series([98,73,65],index=['Andrea', 'Beth', 'Carolina'])

# Divide each element in series by 10
print(s /10, '\n')

# Take the exponential of each element in series
print(np.exp(s), '\n')

# Original series is unchanged
print(s)

Andrea      9.8
Beth        7.3
Carolina    6.5
dtype: float64 

Andrea      3.637971e+42
Beth        5.052394e+31
Carolina    1.694889e+28
dtype: float64 

Andrea      98
Beth        73
Carolina    65
dtype: int64


In [11]:
s > 70

Andrea       True
Beth         True
Carolina    False
dtype: bool

Identifying missing values

In [12]:
# Series with NAs in it
s = pd.Series([1, 2, np.nan, 4, np.nan])
s

0    1.0
1    2.0
2    NaN
3    4.0
4    NaN
dtype: float64

In [13]:
# Check if series has NAs
s.hasnans

True

In [14]:
s.isna()

0    False
1    False
2     True
3    False
4     True
dtype: bool

1. The integer number -999 is often used to represent missing values. Create a pandas.Series named s with four integer values, two of which are -999. The index of this series should be the the letters A through D.

2. In the pandas.Series documentation, look for the method mask(). Use this method to update the series s so that the -999 values are replaced by NA values. HINT: check the first example in the method’s documentation.

    Series.mask(cond, other=<no_default>, *, inplace=False, axis=None, level=None)
    https://pandas.pydata.org/docs/reference/api/pandas.Series.mask.html


In [19]:
s=pd.Series([2, -999, 4, -999],index = ['A', 'B', 'C','D'])
s=s.mask(s==-999)
s

A    2.0
B    NaN
C    4.0
D    NaN
dtype: float64

### Data frames

In [20]:
# Initialize dictionary with columns' data 
d = {'col_name_1' : pd.Series(np.arange(3)),
     'col_name_2' : pd.Series([3.1, 3.2, 3.3]),
     }

# Create data frame
df = pd.DataFrame(d)
df

Unnamed: 0,col_name_1,col_name_2
0,0,3.1
1,1,3.2
2,2,3.3


In [21]:
# Change index
df.index = ['a','b','c']
df

Unnamed: 0,col_name_1,col_name_2
a,0,3.1
b,1,3.2
c,2,3.3


We can access the data frame’s column names via the columns attribute. Update the column names to C1 and C2 by updating this attribute.

In [30]:
df['col_name_1'][2]=1
df['col_name_2'][2]=3.9
df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['col_name_1'][2]=1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['col_name_2'][2]=3.9


Unnamed: 0,col_name_1,col_name_2
a,0,3.1
b,1,3.2
c,1,3.9
