Learning Objectives


• Demonstrate supervised learning algorithms

• Explain key concepts like under- and over-fitting, 
regularization, and cross-validation

• Classify the type of problem to be solved, choose the 
right algorithm, tune parameters, and validate a model

• Apply Intel® Extension for Scikit-learn* to leverage 
underlying compute capabilities of hardware


Our Toolset: Intel® oneAPI AI Analytics Toolkit (AI Kit)

• Jupyter notebooks: interactive coding and 
visualization of output
• NumPy, SciPy, Pandas: numerical computation
• Matplotlib, Seaborn: data visualization
• Scikit-learn: machine learning

Introduction to Pandas
• Library for computation with tabular data
• Mixed types of data allowed in a single 
table
• Columns and rows of data can be named
• Advanced data aggregation and statistical 
functions

Introduction to Pandas
Vector 
(1 Dimension) = Series
Array
(2 Dimensions)
Basic data structures = DataFrame


(1 Dimension) = Series

Pandas Series Creation and Indexing
Use data from step tracking application to create a Pandas Series

In [1]:
import pandas as pd
step_data = [3620, 7891, 9761, 
3907, 4338, 5373]
step_counts = pd.Series(step_data, 
name='steps')
print(step_counts)

0    3620
1    7891
2    9761
3    3907
4    4338
5    5373
Name: steps, dtype: int64


Pandas Series Creation and Indexing
Add a date range to the Series

In [2]:
step_counts.index = pd.date_range('20150329', 
periods=6)
print(step_counts)

2015-03-29    3620
2015-03-30    7891
2015-03-31    9761
2015-04-01    3907
2015-04-02    4338
2015-04-03    5373
Freq: D, Name: steps, dtype: int64


Pandas Series Creation and Indexing
Select data by the index values

In [4]:
print(step_counts['2015-04-01'])


3907


Or by index position--like an array


In [5]:
print(step_counts[3])

3907


Select all of April


In [6]:
print(step_counts['2015-04'])


2015-04-01    3907
2015-04-02    4338
2015-04-03    5373
Freq: D, Name: steps, dtype: int64


## Pandas Data Types and Imputation
## Data types can be viewed and converted


 View the data type


In [15]:
print(step_counts.dtypes)


float64


Convert to a float

In [16]:
step_counts = step_counts.astype(np.float)
print(step_counts.dtypes)


float64


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  step_counts = step_counts.astype(np.float)


Pandas Data Types and Imputation
Code
Invalid data points can be easily filled with values

Create invalid data

In [17]:
step_counts[1:3] = np.NaN

Now fill it in with zeros

In [18]:
step_counts = step_counts.fillna(0.)
print(step_counts[1:3])

2015-03-30    0.0
2015-03-31    0.0
Freq: D, Name: steps, dtype: float64


Pandas DataFrame Creation and Methods
DataFrames can be created from lists, dictionaries, and Pandas Series

Cycling distance

Create a tuple of data
The dataframe

In [22]:
# Cycling distance
cycling_data = [10.7, 0, None, 2.4, 15.3, 
10.9, 0, None]
# Create a tuple of data
joined_data = list(zip(step_data, 
cycling_data))
# The dataframe
activity_df = pd.DataFrame(joined_data)
print(activity_df)


      0     1
0  3620  10.7
1  7891   0.0
2  9761   NaN
3  3907   2.4
4  4338  15.3
5  5373  10.9


# Labeled columns and an index can be added

In [23]:
# Add column names to dataframe
activity_df = pd.DataFrame(
joined_data,
index=pd.date_range('20150329', periods=6),
columns=['Walking','Cycling'])
print(activity_df)

            Walking  Cycling
2015-03-29     3620     10.7
2015-03-30     7891      0.0
2015-03-31     9761      NaN
2015-04-01     3907      2.4
2015-04-02     4338     15.3
2015-04-03     5373     10.9


# DataFrame rows can be indexed by row using the 'loc' and 'iloc' methods


In [24]:
# Select row of data by index name
print(activity_df.loc['2015-04-01'])

Walking    3907.0
Cycling       2.4
Name: 2015-04-01 00:00:00, dtype: float64


In [25]:
# Select row of data by integer position
print(activity_df.iloc[-3])

Walking    3907.0
Cycling       2.4
Name: 2015-04-01 00:00:00, dtype: float64


# DataFrame columns can be indexed by name


In [26]:
# Name of column
print(activity_df['Walking'])


2015-03-29    3620
2015-03-30    7891
2015-03-31    9761
2015-04-01    3907
2015-04-02    4338
2015-04-03    5373
Freq: D, Name: Walking, dtype: int64


# DataFrame columns can also be indexed as properties


In [27]:
# Object-oriented approach
print(activity_df.Walking)

2015-03-29    3620
2015-03-30    7891
2015-03-31    9761
2015-04-01    3907
2015-04-02    4338
2015-04-03    5373
Freq: D, Name: Walking, dtype: int64
