# Intro to pandas

Pandas is a high-level data manipulation tool developed by Wes McKinney. It is built on the Numpy package and offers data structures and operations for manipulating numerical tables and time series. Pandas allows us to import data from various file formats such as comma-separated values, JSON, SQL, Microsoft Excel, etc. Throughout the course, we'll be taking advantage of pandas' various data manipulation operations such as merging, reshaping, selecting, as well as data cleaning, and data wrangling features. Just like Numpy, pandas is highly optimized for performance, with critical code paths written in C/C++.

Let's begin by importing pandas and learning about the Series data type. If for some reason you don't have pandas installed, you will first have to go to your terminal (or Anaconda Prompt if on Windows) and enter the following:

conda install pandas

In [195]:
# Importing pandas and numpy
import pandas as pd
import numpy as np

# Series
A Series is a one-dimensional labeled array. What this means is that we can now access (index) elements in this array using some assigned labels. We create a Series using pd.Series(data, index), where data is our array and index is the corresponding labels. Let's see an example below:

In [196]:
# Example code from pdf
# Create a series with pd.Series(data, index)
my_series = pd.Series(data = [1, 2, 3], index = ['A', 'B', 'C'])
print(my_series)
print(type(my_series))

A    1
B    2
C    3
dtype: int64
<class 'pandas.core.series.Series'>


In [197]:
# My example code
# Create a series with pd.Series(data, index)
series = pd.Series(data = [7, 8, 9], index = ['X', 'Y', 'Z'])
print(series)
print(type(series))

X    7
Y    8
Z    9
dtype: int64
<class 'pandas.core.series.Series'>


In [198]:
# Example code from pdf
# 2 ways accessing first elements
print(f"Accessing first element using my_series[0]: {my_series[0]}")
print(f"Accessing first element using my_series['A']: {my_series['A']}")

Accessing first element using my_series[0]: 1
Accessing first element using my_series['A']: 1


In [199]:
# My example code
# 2 ways accessing first elements
print(f"Accessing first element using series[7]: {series[0]}")
print(f"Accessing first element using series['X']: {series['X']}")

Accessing first element using series[7]: 7
Accessing first element using series['X']: 7


In [200]:
# Example code from pdf
# Pass a dictionary to data
my_series = pd.Series(data = {'A': 1, 'B': 2, 'C': 3})
print(my_series)
print(type(my_series))

A    1
B    2
C    3
dtype: int64
<class 'pandas.core.series.Series'>


In [201]:
# My example code
# Pass a dictionary to data
series = pd.Series(data = {'X': 7, 'Y': 8, 'Z': 9})
print(series)
print(type(series))

X    7
Y    8
Z    9
dtype: int64
<class 'pandas.core.series.Series'>


In [202]:
# Example code from pdf
# Using numpy array
my_series = pd.Series(data = np.array([1, 2, 3]), index = ['A', 'B', 'C'])
print(my_series)
print(type(my_series))

A    1
B    2
C    3
dtype: int32
<class 'pandas.core.series.Series'>


In [203]:
# My example code
# Using numpy array
series = pd.Series(data = np.array([7, 8, 9]), index = ['X', 'Y', 'Z'])
print(series)
print(type(series))

X    7
Y    8
Z    9
dtype: int32
<class 'pandas.core.series.Series'>


In [204]:
# Example code from pdf
my_series = pd.Series(pd.Series(data = [10, 20, 30]))
print(my_series)
print(type(my_series))

0    10
1    20
2    30
dtype: int64
<class 'pandas.core.series.Series'>


In [205]:
# My example code
series = pd.Series(pd.Series(data = [70, 80, 90]))
print(series)
print(type(series))

0    70
1    80
2    90
dtype: int64
<class 'pandas.core.series.Series'>


In [206]:
# Example code from pdf
# First series
week_one = pd.Series(data = [100, 50, 300], index = ['Bob', 'Sally', 'Jess'])
week_one

Bob      100
Sally     50
Jess     300
dtype: int64

In [207]:
# My example code
# First series
first_week = pd.Series(data = [70, 200, 150], index = ['Matthew', 'Mark', 'Luke'])
first_week

Matthew     70
Mark       200
Luke       150
dtype: int64

In [208]:
# Example code from pdf
# Second series
week_two = pd.Series(data = [500, 30, 20], index = ['Bob', 'Sally', 'Jess'])
week_two

Bob      500
Sally     30
Jess      20
dtype: int64

In [209]:
# My example code
# Second series
second_week = pd.Series(data = [300, 50, 90], index = ['Matthew', 'Mark', 'Luke'])
second_week

Matthew    300
Mark        50
Luke        90
dtype: int64

In [210]:
# Example code from pdf
# Add sum two series to get a new series
total_due = week_one + week_two
total_due

Bob      600
Sally     80
Jess     320
dtype: int64

In [211]:
# My example code
# Add sum two series to get a new series
total = first_week + second_week
total

Matthew    370
Mark       250
Luke       240
dtype: int64

# DataFrames
Formally, a DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. It's probably easiest to think of DataFrames as many Series placed next to each other to share a common index label, but you can also think of them as spreadsheets, SQL tables, or a dictionary of Series objects.

We create a DataFrame using pd.DataFrame(data, index, columns). data and index are pretty familiar to us now that we've seen Series, but what is this new columns parameter? Well, if you think of DataFrames as many Series placed next to each other to share a common index label, we need some way of accessing each individual Series. This is where columns comes in. You can think of it as additional lables for each individual Series/column. Let's see an example below.

In [212]:
# Example code from pdf
# Create DataFrame
my_df = pd.DataFrame(data = np.arange(0,20).reshape(4,5), index = ['A', 'B', 'C', 'D'], 
                     columns = ['col1', 'col2', 'col3', 'col4', 'col5'])
print(my_df)
print(type(my_df))

   col1  col2  col3  col4  col5
A     0     1     2     3     4
B     5     6     7     8     9
C    10    11    12    13    14
D    15    16    17    18    19
<class 'pandas.core.frame.DataFrame'>


In [213]:
# My example code
# Create DataFrame
dataF = pd.DataFrame(data = np.arange(0,12).reshape(3,4), index = ['X', 'Y', 'Z'], 
                     columns = ['c1', 'c2', 'c3', 'c4'])
print(dataF)
print(type(dataF))

   c1  c2  c3  c4
X   0   1   2   3
Y   4   5   6   7
Z   8   9  10  11
<class 'pandas.core.frame.DataFrame'>


In [214]:
# Example code from pdf
print(my_df['col2'])
print(type(my_df['col2']))

A     1
B     6
C    11
D    16
Name: col2, dtype: int32
<class 'pandas.core.series.Series'>


In [215]:
# My example code
print(dataF['c2'])
print(type(dataF['c2']))

X    1
Y    5
Z    9
Name: c2, dtype: int32
<class 'pandas.core.series.Series'>


In [216]:
# Example code from pdf
# DataFrame index labels will also default to [0, n)
my_df = pd.DataFrame(data = np.arange(0,20).reshape(4,5), columns = ['col1', 'col2', 'col3', 'col4', 'col5'])
my_df

Unnamed: 0,col1,col2,col3,col4,col5
0,0,1,2,3,4
1,5,6,7,8,9
2,10,11,12,13,14
3,15,16,17,18,19


In [217]:
# My example code
# DataFrame index labels will also default to [0, n)
dataF = pd.DataFrame(data = np.arange(0,12).reshape(3,4), columns = ['c1', 'c2', 'c3', 'c4'])
dataF

Unnamed: 0,c1,c2,c3,c4
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


# DataFrames: Selection

In [218]:
# Example code from pdf
# Insert index labels
my_df = pd.DataFrame(data = np.arange(0,20).reshape(4,5), index = ['A', 'B', 'C', 'D'],
                    columns = ['col1', 'col2', 'col3', 'col4', 'col5'])
my_df

Unnamed: 0,col1,col2,col3,col4,col5
A,0,1,2,3,4
B,5,6,7,8,9
C,10,11,12,13,14
D,15,16,17,18,19


In [219]:
# My example code
# Insert index labels
dataF = pd.DataFrame(data = np.arange(0,12).reshape(3,4), index = ['X', 'Y', 'Z'],
                    columns = ['c1', 'c2', 'c3', 'c4'])
dataF

Unnamed: 0,c1,c2,c3,c4
X,0,1,2,3
Y,4,5,6,7
Z,8,9,10,11


In [220]:
# Example code from pdf
# Access individual columns
my_df['col2']

A     1
B     6
C    11
D    16
Name: col2, dtype: int32

In [221]:
# My example code
# Access individual columns
dataF['c2']

X    1
Y    5
Z    9
Name: c2, dtype: int32

In [222]:
# Example code from pdf
# Access multiple columns
my_df[['col2', 'col3']]

Unnamed: 0,col2,col3
A,1,2
B,6,7
C,11,12
D,16,17


In [223]:
# My example code
# Access multiple columns
dataF[['c2', 'c3']]

Unnamed: 0,c2,c3
X,1,2
Y,5,6
Z,9,10


In [224]:
# Example code from pdf
# Access row info by using .iloc
my_df.iloc[0]

col1    0
col2    1
col3    2
col4    3
col5    4
Name: A, dtype: int32

In [225]:
# My example code
# Access row info by using .iloc
dataF.iloc[0]

c1    0
c2    1
c3    2
c4    3
Name: X, dtype: int32

In [226]:
# Example code from pdf
# Access row info by using .loc
my_df.loc['A']

col1    0
col2    1
col3    2
col4    3
col5    4
Name: A, dtype: int32

In [227]:
# My example code
# Access row info by using .loc
dataF.loc['X']

c1    0
c2    1
c3    2
c4    3
Name: X, dtype: int32

In [228]:
# Example code from pdf
# We can grab section like .iloc[start_index:stop_index]
my_df.iloc[0:3]

Unnamed: 0,col1,col2,col3,col4,col5
A,0,1,2,3,4
B,5,6,7,8,9
C,10,11,12,13,14


In [229]:
# My example code
# We can grab section like .iloc[start_index:stop_index]
dataF.iloc[1:3]

Unnamed: 0,c1,c2,c3,c4
Y,4,5,6,7
Z,8,9,10,11


In [230]:
# Example code from pdf
# We can grab section like .loc[start_index:stop_index]
my_df.loc['A':'C']

Unnamed: 0,col1,col2,col3,col4,col5
A,0,1,2,3,4
B,5,6,7,8,9
C,10,11,12,13,14


In [231]:
# My example code
# We can grab section like .loc[start_index:stop_index]
dataF.loc['Y':'Z']

Unnamed: 0,c1,c2,c3,c4
Y,4,5,6,7
Z,8,9,10,11


In [232]:
# Example code from pdf
my_df.iloc[1:4, 0:3]

Unnamed: 0,col1,col2,col3
B,5,6,7
C,10,11,12
D,15,16,17


In [233]:
# My example code
dataF.iloc[1:3, 0:3]

Unnamed: 0,c1,c2,c3
Y,4,5,6
Z,8,9,10


In [234]:
# Example code from pdf
my_df.loc['B':'D', 'col1':'col3']

Unnamed: 0,col1,col2,col3
B,5,6,7
C,10,11,12
D,15,16,17


In [235]:
# My example code
dataF.loc['Y':'Z', 'c1':'c3']

Unnamed: 0,c1,c2,c3
Y,4,5,6
Z,8,9,10


In [236]:
# Example code from pdf
my_df = pd.DataFrame(data=np.arange(0,20).reshape(4,5), index=['A', 'B', 'C', 'D'],
                    columns=['col1', 'col2', 'col3', 'col4', 'col5'])
my_df

Unnamed: 0,col1,col2,col3,col4,col5
A,0,1,2,3,4
B,5,6,7,8,9
C,10,11,12,13,14
D,15,16,17,18,19


In [237]:
# My example code
dataF = pd.DataFrame(data = np.arange(0,12).reshape(3,4), index = ['X', 'Y', 'Z'],
                    columns = ['c1', 'c2', 'c3', 'c4'])
dataF

Unnamed: 0,c1,c2,c3,c4
X,0,1,2,3
Y,4,5,6,7
Z,8,9,10,11


In [238]:
# Example code from pdf
# Using comparsion operator
my_df % 2 == 0

Unnamed: 0,col1,col2,col3,col4,col5
A,True,False,True,False,True
B,False,True,False,True,False
C,True,False,True,False,True
D,False,True,False,True,False


In [239]:
# My example code
# Using comparsion operator
dataF % 2 == 0

Unnamed: 0,c1,c2,c3,c4
X,True,False,True,False
Y,True,False,True,False
Z,True,False,True,False


In [240]:
# Example code from pdf
# Display NaN values
my_df[my_df % 2 == 0]

Unnamed: 0,col1,col2,col3,col4,col5
A,0.0,,2.0,,4.0
B,,6.0,,8.0,
C,10.0,,12.0,,14.0
D,,16.0,,18.0,


In [241]:
# My example code
# Display NaN values
dataF[dataF % 2 == 0]

Unnamed: 0,c1,c2,c3,c4
X,0,,2,
Y,4,,6,
Z,8,,10,


In [242]:
# Example code from pdf
# filling all NaN values with 0
my_df[my_df % 2 == 0].fillna(value = 0)

Unnamed: 0,col1,col2,col3,col4,col5
A,0.0,0.0,2.0,0.0,4.0
B,0.0,6.0,0.0,8.0,0.0
C,10.0,0.0,12.0,0.0,14.0
D,0.0,16.0,0.0,18.0,0.0


In [243]:
# My example code
# filling all NaN values with 0
dataF[dataF % 2 == 0].fillna(value = 0)

Unnamed: 0,c1,c2,c3,c4
X,0,0.0,2,0.0
Y,4,0.0,6,0.0
Z,8,0.0,10,0.0


In [244]:
# Example code from pdf
# filling all NaN values with whatever the mean of my_df's original col2 is: (1+6+11+16)/4 = 8.5
my_df[my_df % 2 == 0].fillna(value = my_df['col2'].mean())

Unnamed: 0,col1,col2,col3,col4,col5
A,0.0,8.5,2.0,8.5,4.0
B,8.5,6.0,8.5,8.0,8.5
C,10.0,8.5,12.0,8.5,14.0
D,8.5,16.0,8.5,18.0,8.5


In [245]:
# My example code
# filling all NaN values with whatever the mean of my_df's original col2 is: (1+6+11+16)/4 = 8.5
dataF[dataF % 2 == 0].fillna(value = dataF['c2'].mean())

Unnamed: 0,c1,c2,c3,c4
X,0,5.0,2,5.0
Y,4,5.0,6,5.0
Z,8,5.0,10,5.0


# DataFrames: Adding and Dropping Columns

In [246]:
# Example code from pdf
my_df

Unnamed: 0,col1,col2,col3,col4,col5
A,0,1,2,3,4
B,5,6,7,8,9
C,10,11,12,13,14
D,15,16,17,18,19


In [247]:
# My example code
dataF

Unnamed: 0,c1,c2,c3,c4
X,0,1,2,3
Y,4,5,6,7
Z,8,9,10,11


In [248]:
# Example code from pdf
# Add new column
my_df['newCol'] = [10, 20, 30, 40]
my_df

Unnamed: 0,col1,col2,col3,col4,col5,newCol
A,0,1,2,3,4,10
B,5,6,7,8,9,20
C,10,11,12,13,14,30
D,15,16,17,18,19,40


In [249]:
# My example code
# Add new column
dataF['newC'] = [50, 60, 70]
dataF

Unnamed: 0,c1,c2,c3,c4,newC
X,0,1,2,3,50
Y,4,5,6,7,60
Z,8,9,10,11,70


In [250]:
# Example code from pdf
# Add new column with sum of col1 and col2
my_df['col1+col2'] = my_df['col1'] + my_df['col2']
my_df

Unnamed: 0,col1,col2,col3,col4,col5,newCol,col1+col2
A,0,1,2,3,4,10,1
B,5,6,7,8,9,20,11
C,10,11,12,13,14,30,21
D,15,16,17,18,19,40,31


In [251]:
# My example code
# Add new column with sum of col1 and col2
dataF['c1+c2'] = dataF['c1'] + dataF['c2']
dataF

Unnamed: 0,c1,c2,c3,c4,newC,c1+c2
X,0,1,2,3,50,1
Y,4,5,6,7,60,9
Z,8,9,10,11,70,17


In [252]:
# Example code from pdf
# Drop column newCol
my_df.drop(columns=['newCol'])

Unnamed: 0,col1,col2,col3,col4,col5,col1+col2
A,0,1,2,3,4,1
B,5,6,7,8,9,11
C,10,11,12,13,14,21
D,15,16,17,18,19,31


In [253]:
# My example code
# Drop column newC
dataF.drop(columns=['newC'])

Unnamed: 0,c1,c2,c3,c4,c1+c2
X,0,1,2,3,1
Y,4,5,6,7,9
Z,8,9,10,11,17


In [254]:
# Example code from pdf
# Changes are not permanent
my_df

Unnamed: 0,col1,col2,col3,col4,col5,newCol,col1+col2
A,0,1,2,3,4,10,1
B,5,6,7,8,9,20,11
C,10,11,12,13,14,30,21
D,15,16,17,18,19,40,31


In [255]:
# My example code
# Changes are not permanent
dataF

Unnamed: 0,c1,c2,c3,c4,newC,c1+c2
X,0,1,2,3,50,1
Y,4,5,6,7,60,9
Z,8,9,10,11,70,17


In [256]:
# Example code from pdf
# To make changes permanent, we use a parameter inplace = True
my_df.drop(columns = ['newCol', 'col1+col2'], inplace = True)

In [257]:
# My example code
# To make changes permanent, we use a parameter inplace = True
dataF.drop(columns = ['newC', 'c1+c2'], inplace = True)

In [258]:
# Example code from pdf
# Check to see if changes are saved
my_df

Unnamed: 0,col1,col2,col3,col4,col5
A,0,1,2,3,4
B,5,6,7,8,9
C,10,11,12,13,14
D,15,16,17,18,19


In [259]:
# My example code
# Check to see if changes are saved
dataF

Unnamed: 0,c1,c2,c3,c4
X,0,1,2,3
Y,4,5,6,7
Z,8,9,10,11


# DataFrames: Groupby and Common Operations

In [260]:
# Example code from pdf
my_df = pd.DataFrame({'Type': ['Falcon', 'Falcon', 'Parrot', 'Parrot', 'Cat', 'Cat', 'Cat'],
                     'Max Speed': [380., 370., 24., 26., 50., 50., 150.]})
my_df

Unnamed: 0,Type,Max Speed
0,Falcon,380.0
1,Falcon,370.0
2,Parrot,24.0
3,Parrot,26.0
4,Cat,50.0
5,Cat,50.0
6,Cat,150.0


In [261]:
# My example code
dataF = pd.DataFrame({'Type': ['Eagle', 'Eagle', 'Dove', 'Dove', 'Dog', 'Dog', 'Dog'],
                     'Max Speed': [400., 350., 30., 25., 70., 70., 100.]})
dataF

Unnamed: 0,Type,Max Speed
0,Eagle,400.0
1,Eagle,350.0
2,Dove,30.0
3,Dove,25.0
4,Dog,70.0
5,Dog,70.0
6,Dog,100.0


In [262]:
# Example code from pdf
# Calling unique() on columns
print(f"unique types: {my_df['Type'].unique()}")
print(f"unique max speeds: {my_df['Max Speed'].unique()}")

unique types: ['Falcon' 'Parrot' 'Cat']
unique max speeds: [380. 370.  24.  26.  50. 150.]


In [263]:
# My example code
# Calling unique() on columns
print(f"unique types: {dataF['Type'].unique()}")
print(f"unique max speeds: {dataF['Max Speed'].unique()}")

unique types: ['Eagle' 'Dove' 'Dog']
unique max speeds: [400. 350.  30.  25.  70. 100.]


In [264]:
# Example code from pdf
# Unique counts on a column
my_df['Type'].value_counts()

Cat       3
Falcon    2
Parrot    2
Name: Type, dtype: int64

In [265]:
# My example code
# Unique counts on a column
dataF['Type'].value_counts()

Dog      3
Eagle    2
Dove     2
Name: Type, dtype: int64

In [266]:
# Example code from pdf
# Get sum, mean, min, max from a column
print(f"sum of Max Speed col: {my_df['Max Speed'].sum()}")
print(f"mean of Max Speed col: {my_df['Max Speed'].mean()}")
print(f"min from Max Speed col: {my_df['Max Speed'].min()}")
print(f"max from Max Speed col: {my_df['Max Speed'].max()}")

sum of Max Speed col: 1050.0
mean of Max Speed col: 150.0
min from Max Speed col: 24.0
max from Max Speed col: 380.0


In [267]:
# My example code
# Get sum, mean, min, max from a column
print(f"The sum of Max Speed col: {dataF['Max Speed'].sum()}")
print(f"The mean of Max Speed col: {dataF['Max Speed'].mean()}")
print(f"The min from Max Speed col: {dataF['Max Speed'].min()}")
print(f"The max from Max Speed col: {dataF['Max Speed'].max()}")

The sum of Max Speed col: 1045.0
The mean of Max Speed col: 149.28571428571428
The min from Max Speed col: 25.0
The max from Max Speed col: 400.0


In [268]:
# Example code from pdf
# This grouped all cats together, all falcons together, all parrots together and then applied the mean function
# to each groups columns (only Max Speed in this case). So we can see that the mean Max Speed for all cats is
# DataFrame is 83.333333.
my_df.groupby(by='Type').mean()

Unnamed: 0_level_0,Max Speed
Type,Unnamed: 1_level_1
Cat,83.333333
Falcon,375.0
Parrot,25.0


In [269]:
# My example code
# This grouped all cats together, all falcons together, all parrots together and then applied the mean function
# to each groups columns (only Max Speed in this case). So we can see that the mean Max Speed for all cats is
# DataFrame is 83.333333.
dataF.groupby(by='Type').mean()

Unnamed: 0_level_0,Max Speed
Type,Unnamed: 1_level_1
Dog,80.0
Dove,27.5
Eagle,375.0


# Extra Practice
Great! now let's use what we've learned to explore a real dataset using pandas.

We'll be looking at a pokemon dataset, which contains the following attributes:

    #: ID for each pokemon
    Name: Name of each pokemon
    Type 1: Each pokemon has a type, this determines weakness/resistance to attacks
    Type 2: Some pokemon are dual type and have 2
    Total: sum of all stats that come after this, a general guide to how strong a pokemon is
    HP: hit points, or health, defines how much damage a pokemon can withstand before fainting
    Attack: the base modifier for normal attacks (eg. Scratch, Punch)
    Defense: the base damage resistance against normal attacks
    SP Atk: special attack, the base modifier for special attacks (e.g. fire blast, bubble beam)
    SP Def: the base damage resistance against special attacks
    Speed: determines which pokemon attacks first each round

We'll typically be reading in an existing dataset from our computer, which pandas will then convert into a beautiful DataFrame for us rather than having to create the whole thing from scratch like we did in this lab. To do this, we'll use pd.read_csv(filepath_or_buffer). If the dataset is located in the same directory as this jupyter notebook, we can simply provide the name of the dataset file into the filepath_or_buffer parameter. Otherwise we'll have to specify the path to this file.

In [270]:
# Example code from pdf
# Get and read filepath for pokemon file
pokemon_df = pd.read_csv(filepath_or_buffer='Pokemon.csv')

In [271]:
# Example code from pdf
# Preview of the data set from pokemon file
pokemon_df.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


In [272]:
# My practical code
# Preview of the data set with more than 5 values
pokemon_df.head(7)

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
5,5,Charmeleon,Fire,,405,58,64,58,80,65,80,1,False
6,6,Charizard,Fire,Flying,534,78,84,78,109,85,100,1,False


In [273]:
# Example code from pdf
# print summary of our pokemon dataframe by using info()
pokemon_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   #           800 non-null    int64 
 1   Name        800 non-null    object
 2   Type 1      800 non-null    object
 3   Type 2      414 non-null    object
 4   Total       800 non-null    int64 
 5   HP          800 non-null    int64 
 6   Attack      800 non-null    int64 
 7   Defense     800 non-null    int64 
 8   Sp. Atk     800 non-null    int64 
 9   Sp. Def     800 non-null    int64 
 10  Speed       800 non-null    int64 
 11  Generation  800 non-null    int64 
 12  Legendary   800 non-null    bool  
dtypes: bool(1), int64(9), object(3)
memory usage: 75.9+ KB


In [274]:
# Example code from pdf
# Fill missing value in Type 2 with Type 1
pokemon_df['Type 2'].fillna(pokemon_df['Type 1'], inplace=True)

In [275]:
# Example code from pdf
# Check to see Type 2 is filled as Type 1 does
pokemon_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   #           800 non-null    int64 
 1   Name        800 non-null    object
 2   Type 1      800 non-null    object
 3   Type 2      800 non-null    object
 4   Total       800 non-null    int64 
 5   HP          800 non-null    int64 
 6   Attack      800 non-null    int64 
 7   Defense     800 non-null    int64 
 8   Sp. Atk     800 non-null    int64 
 9   Sp. Def     800 non-null    int64 
 10  Speed       800 non-null    int64 
 11  Generation  800 non-null    int64 
 12  Legendary   800 non-null    bool  
dtypes: bool(1), int64(9), object(3)
memory usage: 75.9+ KB


In [276]:
# Example code from pdf
# Search for Legendary pokemon by doing ['Legendary'] == True]
pokemon_df[pokemon_df['Legendary'] == True]

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
156,144,Articuno,Ice,Flying,580,90,85,100,95,125,85,1,True
157,145,Zapdos,Electric,Flying,580,90,90,85,125,90,100,1,True
158,146,Moltres,Fire,Flying,580,90,100,90,125,85,90,1,True
162,150,Mewtwo,Psychic,Psychic,680,106,110,90,154,90,130,1,True
163,150,MewtwoMega Mewtwo X,Psychic,Fighting,780,106,190,100,154,100,130,1,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,600,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,700,50,160,110,160,110,110,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,680,80,160,60,170,130,80,6,True


In [277]:
# My example code
# Search for non-Legendary pokemon by doing ['Legendary'] == False]
pokemon_df[pokemon_df['Legendary'] == False]

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,Fire,309,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
787,711,GourgeistSuper Size,Ghost,Grass,494,85,100,122,58,75,54,6,False
788,712,Bergmite,Ice,Ice,304,55,69,85,32,35,28,6,False
789,713,Avalugg,Ice,Ice,514,95,117,184,44,46,28,6,False
790,714,Noibat,Flying,Dragon,245,40,30,35,45,40,55,6,False


In [278]:
# Example code from pdf
# See how many Legendary pokemon
pokemon_df[pokemon_df['Legendary'] == True].shape

(65, 13)

In [279]:
# My example code
# See how many non-Legendary pokemon
pokemon_df[pokemon_df['Legendary'] == False].shape

(735, 13)

In [280]:
# Example code from pdf
# See how many fire pokemon
pokemon_df[pokemon_df['Type 1'] == 'Fire'].shape

(52, 13)

In [281]:
# My example code
# See how many water pokemon
pokemon_df[pokemon_df['Type 1'] == 'Water'].shape

(112, 13)

In [282]:
# Example code from pdf
# See the max HP
pokemon_df['HP'].idxmax()

261

In [283]:
# My example code
# See the max Speed
pokemon_df['Speed'].idxmax()

431

In [284]:
# Example code from pdf
# Get info from the max HP pokemon
pokemon_df.iloc[pokemon_df['HP'].idxmax()]

#                 242
Name          Blissey
Type 1         Normal
Type 2         Normal
Total             540
HP                255
Attack             10
Defense            10
Sp. Atk            75
Sp. Def           135
Speed              55
Generation          2
Legendary       False
Name: 261, dtype: object

In [285]:
# My example code
# Get info from the max HP pokemon
pokemon_df.iloc[pokemon_df['Speed'].idxmax()]

#                           386
Name          DeoxysSpeed Forme
Type 1                  Psychic
Type 2                  Psychic
Total                       600
HP                           50
Attack                       95
Defense                      90
Sp. Atk                      95
Sp. Def                      90
Speed                       180
Generation                    3
Legendary                  True
Name: 431, dtype: object

In [286]:
# Example code from pdf
# Sort values and head(1)
pokemon_df.sort_values(by='HP', ascending=False).head(1)

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
261,242,Blissey,Normal,Normal,540,255,10,10,75,135,55,2,False


In [287]:
# My example code
# Sort values and head(2)
pokemon_df.sort_values(by='Speed', ascending=False).head(2)

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
431,386,DeoxysSpeed Forme,Psychic,Psychic,600,50,95,90,95,90,180,3,True
315,291,Ninjask,Bug,Flying,456,61,90,45,50,50,160,3,False


In [288]:
# Example code from pdf
# Count values from Type 1 column
pokemon_df['Type 1'].value_counts()

Water       112
Normal       98
Grass        70
Bug          69
Psychic      57
Fire         52
Electric     44
Rock         44
Dragon       32
Ground       32
Ghost        32
Dark         31
Poison       28
Steel        27
Fighting     27
Ice          24
Fairy        17
Flying        4
Name: Type 1, dtype: int64

In [289]:
# My example code
# Count values from Type 2 column
pokemon_df['Type 2'].value_counts()

Flying      99
Water       73
Psychic     71
Normal      65
Grass       58
Poison      49
Ground      48
Fighting    46
Fire        40
Fairy       38
Electric    33
Dark        30
Dragon      29
Steel       27
Ice         27
Ghost       24
Rock        23
Bug         20
Name: Type 2, dtype: int64

In [290]:
# Example code from pdf
# Show mean HP values from Type 1
pokemon_df.groupby(by='Type 1').mean()['HP']

Type 1
Bug         56.884058
Dark        66.806452
Dragon      83.312500
Electric    59.795455
Fairy       74.117647
Fighting    69.851852
Fire        69.903846
Flying      70.750000
Ghost       64.437500
Grass       67.271429
Ground      73.781250
Ice         72.000000
Normal      77.275510
Poison      67.250000
Psychic     70.631579
Rock        65.363636
Steel       65.222222
Water       72.062500
Name: HP, dtype: float64

In [291]:
# My example code
# Show mean Speed values from Type 1
pokemon_df.groupby(by='Type 1').mean()['Speed']

Type 1
Bug          61.681159
Dark         76.161290
Dragon       83.031250
Electric     84.500000
Fairy        48.588235
Fighting     66.074074
Fire         74.442308
Flying      102.500000
Ghost        64.343750
Grass        61.928571
Ground       63.906250
Ice          63.458333
Normal       71.551020
Poison       63.571429
Psychic      81.491228
Rock         55.909091
Steel        55.259259
Water        65.964286
Name: Speed, dtype: float64