You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Among the two pre-eminent supervised and unsupervised machine learning techniques, classification is a popular method of the supervised algorithms -- where labeled examples of prior instances by humans can guide the training of a machine. Below, we introduce classfication with a few hands-on examples.
Previously, in BQML model, we developed in-database classification model directly in the big query data warehouse, so continuous training, continuous scoring methods are totally opaque, managed, and seamless to consumers.
-- Jump to https://console.cloud.google.com/bigquery?project=project-dynamic-modeling&p=project-dynamic-modeling
-- and key the model as following
CREATE OR REPLACE MODEL
`bqml_tutorial.cardio logistic_model` OPTIONS
(model type='LOGISTIC REG',
auto_class_weights=TRUE,
input_label_cols=['cardio']) AS
SELECT age, gender, height, weight, ap_hi,
ap_lo, cholesterol, gluc, smoke,
alco, active, cardio
FROM `project-dynamic-modeling.cardio_disease.cardio_disease`
There is also a managed service in Google Cloud Platform (GCP) -- called AutoML tables -- which provides a total seamless experience for citizen data science.
Today, we focus on the middle: building the classification model from the start. Specifically, we will be using the Google Colab (a freemium JuPyteR notebook service for Julia, Python and R) to ingest, shape, explore, visualize, and model data.
There is also a managed JupyterHub environment offered by Google (called AI Notebooks) that we will utilize later.
Garfield TVitcharoo Trivia
Garfield lays down in the evening to watch TV. I think his biometrics and activity choices during the day are indicative of his TV propensity at night. Do you see a pattern in this data?
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. It was developed by Wes McKinney in 2008.
importpandasaspd# Fetch data from URL# Of course, pandas can fetch data from many other sources like SQL databases, Files, Cloud etcgarfield_biometrics=pd.read_csv('https://drive.google.com/uc?export=download&id=1_pOxAYnUWZ0FNdVnPMdkBo0WEgfMZLy0').\
applymap(lambdax: xifxisnotNoneandstr(x).lower() !='nan'elseNone)
display(garfield_biometrics)
Pandas can be used to slice, dice, and describe data of course. And traditional sorting, filtering, grouping, transforms work too.
# Columnar Definition of the datafromIPython.displayimport*display(HTML("The columns of the dataframe are"))
display(pd.DataFrame(garfield_biometrics.columns, columns=['Column Name']).set_index('Column Name').T)
display(HTML("The rows of the dataframe are"))
display(pd.DataFrame(garfield_biometrics.index, columns=['Row Name']).set_index('Row Name').T)
The columns of the dataframe are
Column Name
Day
8AM
9AM
10AM
11AM
Noon
Lunch Bill
1PM
2PM
3PM
4PM
5PM
Commute
DayOfWeek
WatchTV
The rows of the dataframe are
Row Name
0
1
2
3
4
5
6
7
8
9
...
11
12
13
14
15
16
17
18
19
20
0 rows × 21 columns
# Shape of the dataHTML(f'The shape of the dataset is {garfield_biometrics.shape[0]} rows and {garfield_biometrics.shape[1]} columns')
The shape of the dataset is 21 rows and 15 columns
# Slice of the data, row wise alternatesdisplay(garfield_biometrics[4:12:2].head(100))
Day
8AM
9AM
10AM
11AM
Noon
Lunch Bill
1PM
2PM
3PM
4PM
5PM
Commute
DayOfWeek
WatchTV
4
5-Jan-21
Doughnut
3
10
3
Sandwich
7.35
0
7
6
Tea
7
Long
Fri
Yes
6
7-Jan-21
Doughnut
3
10
7
Lenthils
2.80
10
6
1
Coffee
6
Short
Mon
No
8
9-Jan-21
Sandwich
5
4
7
Lenthils
2.98
2
1
3
PingPong
5
Short
Wed
No
10
11-Jan-21
Doughnut
7
9
8
Sandwich
7.35
1
4
4
Workout
3
Long
Fri
Yes
# Slice of the data, column wise, first five columnsdisplay(garfield_biometrics.iloc[:,:5].head(5))
Day
8AM
9AM
10AM
11AM
0
1-Jan-21
Coffee
6
6
0
1
2-Jan-21
Doughnut
2
5
5
2
3-Jan-21
Coffee
7
10
9
3
4-Jan-21
Coffee
9
7
8
4
5-Jan-21
Doughnut
3
10
3
# Sorted by breakfast at 8AMdisplay(garfield_biometrics.sort_values('8AM').head(5))
Day
8AM
9AM
10AM
11AM
Noon
Lunch Bill
1PM
2PM
3PM
4PM
5PM
Commute
DayOfWeek
WatchTV
0
1-Jan-21
Coffee
6
6
0
Sandwich
7.35
9
8
5
Tea
2
Long
Mon
Yes
18
19-Jan-21
Coffee
5
7
6
Taco
4.75
9
6
10
Workout
5
Short
Mon
No
15
16-Jan-21
Coffee
6
0
1
Lenthils
3.20
4
10
3
PingPong
6
Short
Thu
No
14
15-Jan-21
Coffee
5
9
5
Taco
4.60
8
0
3
Coffee
10
Short
Wed
Yes
19
20-Jan-21
Coffee
6
0
2
Sandwich
7.35
6
7
4
Workout
6
Short
Tue
None
# Specific Columnsdisplay(garfield_biometrics[['8AM', 'Noon', 'Commute', 'WatchTV']].head(5))
8AM
Noon
Commute
WatchTV
0
Coffee
Sandwich
Long
Yes
1
Doughnut
Lenthils
Short
No
2
Coffee
Taco
Short
No
3
Coffee
Sandwich
Short
Yes
4
Doughnut
Sandwich
Long
Yes
# Group count Lunchesgarfield_biometrics.groupby('Noon')['Noon'].agg('count').to_frame()
Noon
Noon
Lenthils
6
Sandwich
8
Taco
7
# Filter: what is the WatchTV when Lunch is Tacogarfield_biometrics[garfield_biometrics.Noon=='Taco'][['Noon', 'WatchTV']]
Noon
WatchTV
2
Taco
No
7
Taco
No
9
Taco
Yes
12
Taco
No
14
Taco
Yes
17
Taco
Yes
18
Taco
No
# Data Types of the fielddisplay(garfield_biometrics.dtypes)
display(garfield_biometrics.astype(str).dtypes)
# String Type Selection Onlygarfield_biometrics.select_dtypes('object').head()
Day
8AM
Noon
4PM
Commute
DayOfWeek
WatchTV
0
1-Jan-21
Coffee
Sandwich
Tea
Long
Mon
Yes
1
2-Jan-21
Doughnut
Lenthils
PingPong
Short
Tue
No
2
3-Jan-21
Coffee
Taco
PingPong
Short
Wed
No
3
4-Jan-21
Coffee
Sandwich
PingPong
Short
Thu
Yes
4
5-Jan-21
Doughnut
Sandwich
Tea
Long
Fri
Yes
# Capture categorical variablesstring_types=garfield_biometrics.select_dtypes('object').columns.tolist()
# Make a in-mem copy of the tablegarfield_biometrics_copy=garfield_biometrics.copy()
# For each column that is string, xform to titlecase and remove any trailing spacegarfield_biometrics_copy[string_types] =garfield_biometrics_copy[string_types].applymap(lambdax: str(x).strip().title() ifxisnotNoneandstr(x).lower() !='none'elseNone)
# Previewdisplay(garfield_biometrics_copy.head(200))
Day
8AM
9AM
10AM
11AM
Noon
Lunch Bill
1PM
2PM
3PM
4PM
5PM
Commute
DayOfWeek
WatchTV
0
1-Jan-21
Coffee
6
6
0
Sandwich
7.35
9
8
5
Tea
2
Long
Mon
Yes
1
2-Jan-21
Doughnut
2
5
5
Lenthils
3.02
3
4
3
Pingpong
0
Short
Tue
No
2
3-Jan-21
Coffee
7
10
9
Taco
4.50
0
4
3
Pingpong
7
Short
Wed
No
3
4-Jan-21
Coffee
9
7
8
Sandwich
7.35
2
6
2
Pingpong
5
Short
Thu
Yes
4
5-Jan-21
Doughnut
3
10
3
Sandwich
7.35
0
7
6
Tea
7
Long
Fri
Yes
5
6-Jan-21
Sandwich
9
9
1
Lenthils
2.98
5
7
10
Coffee
10
Short
Sat
No
6
7-Jan-21
Doughnut
3
10
7
Lenthils
2.80
10
6
1
Coffee
6
Short
Mon
No
7
8-Jan-21
Coffee
3
0
6
Taco
4.40
8
5
3
Tea
6
Short
Tue
No
8
9-Jan-21
Sandwich
5
4
7
Lenthils
2.98
2
1
3
Pingpong
5
Short
Wed
No
9
10-Jan-21
Coffee
6
10
1
Taco
5.00
4
3
5
Workout
0
Short
Thu
Yes
10
11-Jan-21
Doughnut
7
9
8
Sandwich
7.35
1
4
4
Workout
3
Long
Fri
Yes
11
12-Jan-21
Sandwich
9
6
7
Sandwich
7.39
10
7
3
Workout
5
Long
Sat
Yes
12
13-Jan-21
Sandwich
8
10
7
Taco
4.50
9
0
3
Pingpong
1
Short
Mon
No
13
14-Jan-21
Doughnut
2
2
2
Sandwich
7.25
9
4
4
Tea
9
Short
Tue
Yes
14
15-Jan-21
Coffee
5
9
5
Taco
4.60
8
0
3
Coffee
10
Short
Wed
Yes
15
16-Jan-21
Coffee
6
0
1
Lenthils
3.20
4
10
3
Pingpong
6
Short
Thu
No
16
17-Jan-21
Sandwich
0
9
5
Sandwich
7.45
0
6
3
Pingpong
3
Short
Fri
Yes
17
18-Jan-21
Doughnut
2
0
4
Taco
4.80
8
5
5
Coffee
2
Long
Sat
Yes
18
19-Jan-21
Coffee
5
7
6
Taco
4.75
9
6
10
Workout
5
Short
Mon
No
19
20-Jan-21
Coffee
6
0
2
Sandwich
7.35
6
7
4
Workout
6
Short
Tue
None
20
21-Jan-21
Coffee
9
9
3
Lenthils
2.79
6
9
4
Pingpong
9
Long
Wed
None
# Rename '8AM' to 'Breakfast', 'Noon' to 'Lunch', '4PM' to 'Post Siesta'display(garfield_biometrics_copy.rename({
'8AM':'Breakfast',
'Noon':'Lunch',
'4PM':'Post Siesta'
}, axis=1).head(5))
# Notice pandas dataframe is immutable: aka it creates a copy when you change somethingdisplay(pd.DataFrame(garfield_biometrics_copy.columns, columns=['Column Name']).set_index('Column Name').T)
Let us describe (aka skew, mean, mode, min, max) distributions of the data
# Descriptive Stats of the numerical Attributesdisplay(garfield_biometrics_copy.describe())
9AM
10AM
11AM
Lunch Bill
1PM
2PM
3PM
5PM
count
21.000000
21.000000
21.000000
21.000000
21.000000
21.000000
21.000000
21.000000
mean
5.333333
6.285714
4.619048
5.198095
5.380952
5.190476
4.142857
5.095238
std
2.708013
3.809762
2.729033
1.867262
3.570381
2.676174
2.242448
3.048028
min
0.000000
0.000000
0.000000
2.790000
0.000000
0.000000
1.000000
0.000000
25%
3.000000
4.000000
2.000000
3.200000
2.000000
4.000000
3.000000
3.000000
50%
6.000000
7.000000
5.000000
4.750000
6.000000
6.000000
3.000000
5.000000
75%
7.000000
9.000000
7.000000
7.350000
9.000000
7.000000
5.000000
7.000000
max
9.000000
10.000000
9.000000
7.450000
10.000000
10.000000
10.000000
10.000000
Visualizing the data distributions
%matplotlibinlineimportseabornassnsimportnumpyasnp# Set figure sizesns.set(rc={'figure.figsize':(9.0, 5.0)}, style="darkgrid")
# Show distribution plotssns.kdeplot(data=garfield_biometrics_copy.select_dtypes(include=np.number))
<AxesSubplot:ylabel='Density'>
# Describe categorical types of data as wellgarfield_biometrics_copy.select_dtypes(exclude=np.number).describe(include='all')
#garfield_biometrics_copy.Lunch.value_counts().plot(kind='bar')
Day
Breakfast
Lunch
Post Siesta
Commute
DayOfWeek
WatchTV
count
21
21
21
21
21
21
19
unique
21
3
3
4
2
6
2
top
6-Jan-21
Coffee
Sandwich
Pingpong
Short
Mon
Yes
freq
1
10
8
8
15
4
10
Training, Testing, Scoring Datasets
Unknown
Unknown
Dog
Unknown
Unknown
Cat
Training Dataset: The sample of labeled data used to fit the model.
The actual dataset that we use to train the model. The model sees and learns from this data.
Testing Dataset: The sample of labeled data used to provide an unbiased evaluation of a final model fit on the training dataset. A test dataset is independent of the training dataset, but it follows the same probability distribution as the training dataset.
If a model learned from the training dataset also fits the test dataset well, the model is NOT overfit.
If a model learned from the training dataset does not predict test dataset well, the model is overfit.
Scoring Dataset: The unlabeled data -- from real world -- that is used to predict outcomes from the trained model.
Labeled Data
Only consider the "labeled" data: data that has been "supervised" by human intelligence. Notice that our toy data has missing labels in the last two rows. We will use this data as "scoring" data.
Numerical Attributes -- Independent variables in the study usually represented as a real number.
Temporal Attributes -- Time variable: for example, date fields. Span/aging factors can be derived.
Spatial Attributes -- Location variable: for example, latitude and longitude. Distance factors can be derived.
Ordinal Attributes -- Numerical or Text variables: implies ordering. For example, low, medium, high can be encoded as 1, 2, 3 respectively
Categorical Attributes -- String variables: usually do not imply any ordinality (ordering) but have small cardinality. For example, Male-Female, Winter-Spring-Summer-Fall
Text Attributes -- String variables that usually have very hgh cardinality. For example, user reviews with commentary
ID Attributes -- Identity attributes (usually string/long numbers) that have no significance in predicting outcome. For example, social security number, warehouse id. It is best to avoid these ID attributes in the modeling exercise.
Leakage attributes -- redundant attributes that usually are deterministically correlated with the outcome label attribute. For example, say we have two temperature attributes -- one in Fahrenheit and another Celsius -- where Fahrenheit temperature is the predicted attribute, having the Celsius accidentally in the modeling will lead to absolute predictions that fail to capture true stochasticity of the model.
Labels
Categorical Labels -- Usually a string or ordinal variable with small cardinality. For example, asymptomatic recovery, symptomatic recovery, intensive care recovery, fatal. This usually indicates a classification problem.
Numerical Labels -- Usually a numerical output variable. For example, business travel volume. This usually indicates a regression problem.
When labels do not exist in the dataset, it usually indicates a unsupervised learning problem.
Data Imputations
Impute missing values with mean, interpolation, forward-fill, backward-fill, drop altogether.
# Where do we have invalid valuesgarfield_biometrics_copy.isna()[-4:]
Day
Breakfast
9AM
10AM
11AM
Lunch
Lunch Bill
1PM
2PM
3PM
Post Siesta
5PM
Commute
DayOfWeek
WatchTV
17
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
18
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
19
False
False
False
False
False
False
False
False
False
False
False
False
False
False
True
20
False
False
False
False
False
False
False
False
False
False
False
False
False
False
True
# Impute with forward filldisplay(garfield_biometrics_copy.fillna(method='ffill')[-3:])
Day
Breakfast
9AM
10AM
11AM
Lunch
Lunch Bill
1PM
2PM
3PM
Post Siesta
5PM
Commute
DayOfWeek
WatchTV
18
19-Jan-21
Coffee
5
7
6
Taco
4.75
9
6
10
Workout
5
Short
Mon
No
19
20-Jan-21
Coffee
6
0
2
Sandwich
7.35
6
7
4
Workout
6
Short
Tue
No
20
21-Jan-21
Coffee
9
9
3
Lenthils
2.79
6
9
4
Pingpong
9
Long
Wed
No
# Impute with backfilldisplay(garfield_biometrics_copy.fillna(method='bfill')[-3:])
Day
Breakfast
9AM
10AM
11AM
Lunch
Lunch Bill
1PM
2PM
3PM
Post Siesta
5PM
Commute
DayOfWeek
WatchTV
18
19-Jan-21
Coffee
5
7
6
Taco
4.75
9
6
10
Workout
5
Short
Mon
No
19
20-Jan-21
Coffee
6
0
2
Sandwich
7.35
6
7
4
Workout
6
Short
Tue
None
20
21-Jan-21
Coffee
9
9
3
Lenthils
2.79
6
9
4
Pingpong
9
Long
Wed
None
# Impute with modedisplay(garfield_biometrics_copy.fillna(garfield_biometrics_copy.WatchTV.mode()[0])[-3:])
Day
Breakfast
9AM
10AM
11AM
Lunch
Lunch Bill
1PM
2PM
3PM
Post Siesta
5PM
Commute
DayOfWeek
WatchTV
18
19-Jan-21
Coffee
5
7
6
Taco
4.75
9
6
10
Workout
5
Short
Mon
No
19
20-Jan-21
Coffee
6
0
2
Sandwich
7.35
6
7
4
Workout
6
Short
Tue
Yes
20
21-Jan-21
Coffee
9
9
3
Lenthils
2.79
6
9
4
Pingpong
9
Long
Wed
Yes
# Impute with mediandisplay(garfield_biometrics_copy.fillna(garfield_biometrics_copy.WatchTV.value_counts().idxmax())[-3:])
Day
Breakfast
9AM
10AM
11AM
Lunch
Lunch Bill
1PM
2PM
3PM
Post Siesta
5PM
Commute
DayOfWeek
WatchTV
18
19-Jan-21
Coffee
5
7
6
Taco
4.75
9
6
10
Workout
5
Short
Mon
No
19
20-Jan-21
Coffee
6
0
2
Sandwich
7.35
6
7
4
Workout
6
Short
Tue
Yes
20
21-Jan-21
Coffee
9
9
3
Lenthils
2.79
6
9
4
Pingpong
9
Long
Wed
Yes
Data Shaping
Pivot, transpose, or interpolate data
# Original data previewdisplay(garfield_biometrics_copy.tail(5))
# Set daily "index"display(garfield_biometrics_copy.set_index('Day').head(5))
Breakfast
9AM
10AM
11AM
Lunch
Lunch Bill
1PM
2PM
3PM
Post Siesta
5PM
Commute
DayOfWeek
WatchTV
Day
1-Jan-21
Coffee
6
6
0
Sandwich
7.35
9
8
5
Tea
2
Long
Mon
Yes
2-Jan-21
Doughnut
2
5
5
Lenthils
3.02
3
4
3
Pingpong
0
Short
Tue
No
3-Jan-21
Coffee
7
10
9
Taco
4.50
0
4
3
Pingpong
7
Short
Wed
No
4-Jan-21
Coffee
9
7
8
Sandwich
7.35
2
6
2
Pingpong
5
Short
Thu
Yes
5-Jan-21
Doughnut
3
10
3
Sandwich
7.35
0
7
6
Tea
7
Long
Fri
Yes
# Reindex into a proper datetime formatdisplay(garfield_biometrics_copy.set_index(pd.to_datetime(garfield_biometrics_copy.Day)).drop('Day', axis=1).head(5))
Breakfast
9AM
10AM
11AM
Lunch
Lunch Bill
1PM
2PM
3PM
Post Siesta
5PM
Commute
DayOfWeek
WatchTV
Day
2021-01-01
Coffee
6
6
0
Sandwich
7.35
9
8
5
Tea
2
Long
Mon
Yes
2021-01-02
Doughnut
2
5
5
Lenthils
3.02
3
4
3
Pingpong
0
Short
Tue
No
2021-01-03
Coffee
7
10
9
Taco
4.50
0
4
3
Pingpong
7
Short
Wed
No
2021-01-04
Coffee
9
7
8
Sandwich
7.35
2
6
2
Pingpong
5
Short
Thu
Yes
2021-01-05
Doughnut
3
10
3
Sandwich
7.35
0
7
6
Tea
7
Long
Fri
Yes
# Reindex into two-day intervalsgarfield_biometrics_copy.set_index(pd.to_datetime(garfield_biometrics_copy.Day)).drop('Day', axis=1).resample('2d').agg(lambdax: x.value_counts().idxmax()).head(5)
Exclude ID attributes, leakage attributes and include only numerical, temporal, spatial, ordinal, and categorical attributes. Also encode labels accordingly.
# Leave out date attributegarfield_data=garfield_biometrics_copy.set_index(pd.to_datetime(garfield_biometrics_copy.Day)).drop('Day', axis=1)
# Fix the DayOfWeek toogarfield_data['DayOfWeek'] =list(map(lambdax: x.strftime('%A'), garfield_data.index))
# show previewdisplay(garfield_data.head(4))
Breakfast
9AM
10AM
11AM
Lunch
Lunch Bill
1PM
2PM
3PM
Post Siesta
5PM
Commute
DayOfWeek
WatchTV
Day
2021-01-01
Coffee
6
6
0
Sandwich
7.35
9
8
5
Tea
2
Long
Friday
Yes
2021-01-02
Doughnut
2
5
5
Lenthils
3.02
3
4
3
Pingpong
0
Short
Saturday
No
2021-01-03
Coffee
7
10
9
Taco
4.50
0
4
3
Pingpong
7
Short
Sunday
No
2021-01-04
Coffee
9
7
8
Sandwich
7.35
2
6
2
Pingpong
5
Short
Monday
Yes
# Most handy functiongarfield_numerical_data=pd.get_dummies(garfield_data)
display(garfield_numerical_data.head(5))
9AM
10AM
11AM
Lunch Bill
1PM
2PM
3PM
5PM
Breakfast_Coffee
Breakfast_Doughnut
...
Commute_Short
DayOfWeek_Friday
DayOfWeek_Monday
DayOfWeek_Saturday
DayOfWeek_Sunday
DayOfWeek_Thursday
DayOfWeek_Tuesday
DayOfWeek_Wednesday
WatchTV_No
WatchTV_Yes
Day
2021-01-01
6
6
0
7.35
9
8
5
2
1
0
...
0
1
0
0
0
0
0
0
0
1
2021-01-02
2
5
5
3.02
3
4
3
0
0
1
...
1
0
0
1
0
0
0
0
1
0
2021-01-03
7
10
9
4.50
0
4
3
7
1
0
...
1
0
0
0
1
0
0
0
1
0
2021-01-04
9
7
8
7.35
2
6
2
5
1
0
...
1
0
1
0
0
0
0
0
0
1
2021-01-05
3
10
3
7.35
0
7
6
7
0
1
...
0
0
0
0
0
0
1
0
0
1
5 rows × 29 columns
Leading Indicators
Compute quickly correlation coefficients to determine if moving one has any bearing on the output.
Correlated Factors
Uncorrelated Factors
(labeled_data, scoring_data) = (garfield_numerical_data[:19], garfield_numerical_data[19:])
# Develop a handy function to select input attributesinput_data=lambdadf: df[[colforcolindf.columnsif'WatchTV'notincol]]
label_data=lambdadf: df.WatchTV_Yes# Display previewdisplay(input_data(labeled_data).head(3))
9AM
10AM
11AM
Lunch Bill
1PM
2PM
3PM
5PM
Breakfast_Coffee
Breakfast_Doughnut
...
Post Siesta_Workout
Commute_Long
Commute_Short
DayOfWeek_Friday
DayOfWeek_Monday
DayOfWeek_Saturday
DayOfWeek_Sunday
DayOfWeek_Thursday
DayOfWeek_Tuesday
DayOfWeek_Wednesday
Day
2021-01-01
6
6
0
7.35
9
8
5
2
1
0
...
0
1
0
1
0
0
0
0
0
0
2021-01-02
2
5
5
3.02
3
4
3
0
0
1
...
0
0
1
0
0
1
0
0
0
0
2021-01-03
7
10
9
4.50
0
4
3
7
1
0
...
0
0
1
0
0
0
1
0
0
0
3 rows × 27 columns
Simple Pearson Correlation
Compute correlation vector between label and input attributes (direction & magnitude of change)
corr_scores=input_data(labeled_data).corrwith(label_data(labeled_data)).to_frame('Correlation')
# Project a score for absolute score -- positive & negative are both indicativecorr_scores['Abs Correlation'] =corr_scores['Correlation'].apply(abs)
corr_scores=corr_scores.sort_values('Abs Correlation', ascending=False)
display(corr_scores.head(100))
Initialized empty Git repository in /home/jovyan/CloudSDK/covid_data/.git/
Branch 'master' set up to track remote branch 'master' from 'origin'.
From https://github.com/CSSEGISandData/COVID-19
* branch master -> FETCH_HEAD
* [new branch] master -> origin/master
Already on 'master'
Collate
Collate data together temporally and compute active, tested, confirmed, recovered cases in the US.
Ensure we rollup the weekly data and align to Monday
weekly_covid_data=covid_us_data.groupby(['Province_State', 'Lat', 'Long_']).resample('W-MON').agg({'Confirmed':sum, 'Deaths':sum, 'Recovered':sum, 'Active':sum})
display(weekly_covid_data)
# Flatten to make it weekly data; ready for plottingplot_data=weekly_covid_data.reset_index()
display(plot_data)
Model Garfield TVitcharoo Bot using simple logistic regression. Logistic regression is similar to linear regression, but instead of predicting a continuous output, it classifies training examples by a set of categories or labels. For example, linear regression on a set of electoral surveys may be used to predict candidate's electoral votes, but logistic regression could be used to predict presidential elect. Logistic regression predicts classes, not numeric magnitude. It can easily be used to predict multiclass problems where there are more than two label categories.
# Training and scoring data
(labeled_data, scoring_data) = (garfield_numerical_data[:19], garfield_numerical_data[19:])
# Collect input input_data=lambdadf: df[[colforcolindf.columnsif'WatchTV'notincol]]
# Collect outputlabel_data=lambdadf: df.WatchTV_YesX=input_data(labeled_data)
y=label_data(labeled_data)
# Display so we can see label data and input datadisplay(pd.concat([X.head(3), y.head(3).to_frame('WatchTV_Yes')], axis=1))
# Build a modelfromsklearn.linear_modelimportLogisticRegressionclf=LogisticRegression().fit(X, y)
clf
9AM
10AM
11AM
Lunch Bill
1PM
2PM
3PM
5PM
Breakfast_Coffee
Breakfast_Doughnut
...
Commute_Long
Commute_Short
DayOfWeek_Friday
DayOfWeek_Monday
DayOfWeek_Saturday
DayOfWeek_Sunday
DayOfWeek_Thursday
DayOfWeek_Tuesday
DayOfWeek_Wednesday
WatchTV_Yes
Day
2021-01-01
6
6
0
7.35
9
8
5
2
1
0
...
1
0
1
0
0
0
0
0
0
1
2021-01-02
2
5
5
3.02
3
4
3
0
0
1
...
0
1
0
0
1
0
0
0
0
0
2021-01-03
7
10
9
4.50
0
4
3
7
1
0
...
0
1
0
0
0
1
0
0
0
0
3 rows × 28 columns
LogisticRegression()
Use Model to Predict
# Use model to predict scoring datadisplay(pd.DataFrame(clf.predict(input_data(scoring_data)), columns=['WatchTV']))
# Also display probabilistic scoresdisplay(pd.DataFrame(clf.predict_proba(input_data(scoring_data)), columns=['WatchTV_Yes_0', 'WatchTV_Yes_1']))
WatchTV
0
1
1
0
WatchTV_Yes_0
WatchTV_Yes_1
0
0.018344
0.981656
1
0.948408
0.051592
Explaining the Model
Can we explain the model? Logistic Regression is just a linear regression: where a continuous variable is classed into a category based on a logistic curve.
Naive Bayes classifiers are built on Bayesian classification methods.
These rely on Bayes's theorem, which is an equation describing the relationship of conditional probabilities of statistical quantities.
In Bayesian classification, we're interested in finding the probability of a label given some observed features, which we can write as $P(L~|~{\rm features})$.
Bayes's theorem tells us how to express this in terms of quantities we can compute more directly:
If we are trying to decide between two labels—let's call them $L_1$ and $L_2$—then one way to make this decision is to compute the ratio of the posterior probabilities for each label:
All we need now is some model by which we can compute $P({\rm features}~|~L_i)$ for each label.
Such a model is called a generative model because it specifies the hypothetical random process that generates the data.
Specifying this generative model for each label is the main piece of the training of such a Bayesian classifier.
The general version of such a training step is a very difficult task, but we can make it simpler through the use of some simplifying assumptions about the form of this model.
This is where the "naive" in "naive Bayes" comes in: if we make very naive assumptions about the generative model for each label, we can find a rough approximation of the generative model for each class, and then proceed with the Bayesian classification.
Different types of naive Bayes classifiers rest on different naive assumptions about the data, and we will examine a few of these in the following sections.
What went wrong? Remember Bayesian models make assumptions about prior probabilities. In our case, we assumed our data followed "Gaussian" distribution, but remember the one hot encoding uses 0 and 1 (bimodal encoding) of features. NB classifier is a parametric model.
fromsklearn.preprocessingimportStandardScalerscaler=StandardScaler()
# Normalize into a Gaussian bell curvescaler.fit(input_data(labeled_data))
# Pretty printXstd=pd.DataFrame(scaler.transform(input_data(labeled_data)), columns=input_data(labeled_data).columns)
display(Xstd.head())
9AM
10AM
11AM
Lunch Bill
1PM
2PM
3PM
5PM
Breakfast_Coffee
Breakfast_Doughnut
...
Post Siesta_Workout
Commute_Long
Commute_Short
DayOfWeek_Friday
DayOfWeek_Monday
DayOfWeek_Saturday
DayOfWeek_Sunday
DayOfWeek_Thursday
DayOfWeek_Tuesday
DayOfWeek_Wednesday
0
0.339728
-0.132525
-1.793267
1.210436
1.007429
1.216559
0.366103
-0.954296
1.172604
-0.679366
...
-0.516398
1.673320
-1.673320
2.309401
-0.433013
-0.433013
-0.433013
-0.342997
-0.433013
-0.342997
1
-1.179057
-0.412300
0.058476
-1.240525
-0.633241
-0.350534
-0.503392
-1.625838
-0.852803
1.471960
...
-0.516398
-0.597614
0.597614
-0.433013
-0.433013
2.309401
-0.433013
-0.342997
-0.433013
-0.342997
2
0.719425
0.986575
1.539870
-0.402783
-1.453576
-0.350534
-0.503392
0.724558
1.172604
-0.679366
...
-0.516398
-0.597614
0.597614
-0.433013
-0.433013
-0.433013
2.309401
-0.342997
-0.433013
-0.342997
3
1.478817
0.147250
1.169522
1.210436
-0.906686
0.433013
-0.938139
0.053016
1.172604
-0.679366
...
-0.516398
-0.597614
0.597614
-0.433013
2.309401
-0.433013
-0.433013
-0.342997
-0.433013
-0.342997
4
-0.799361
0.986575
-0.682221
1.210436
-1.453576
0.824786
0.800850
0.724558
-0.852803
1.471960
...
-0.516398
1.673320
-1.673320
-0.433013
-0.433013
-0.433013
-0.433013
-0.342997
2.309401
-0.342997
5 rows × 27 columns
# Also display probabilistic scoresdisplay(pd.DataFrame(GaussianNB().fit(Xstd, y).\
predict_proba(scaler.transform(input_data(scoring_data))), \
columns=['WatchTV_Yes_0', 'WatchTV_Yes_1']))
The principle behind nearest neighbor methods is to find a predefined number of training samples (K) closest in distance to the new point, and predict the label from these. The number of samples can be a user-defined constant (k-nearest neighbor learning), or vary based on the local density of points (radius-based neighbor learning).
fromsklearn.neighborsimportKNeighborsClassifierneigh=KNeighborsClassifier(n_neighbors=5).fit(X,y)
# Use model to predict scoring datadisplay(pd.DataFrame(neigh.predict(input_data(scoring_data)), columns=['WatchTV']))
# Also display probabilistic scoresdisplay(pd.DataFrame(neigh.predict_proba(input_data(scoring_data)), columns=['WatchTV_No', 'WatchTV_Yes']))
Decision trees are extremely intuitive ways to classify or label objects: you simply ask a series of questions designed to zero-in on the classification: very similar to 20 questions game where the response can only be a yes/no. Random forests are an example of an ensemble learner built on decision trees. Ensemble methods rely on aggregating the results of an ensemble of simpler estimators. The somewhat surprising result with such ensemble methods is that the sum can be greater than the parts: that is, a majority vote among a number of estimators can end up being better than any of the individual estimators doing the voting
fromsklearn.treeimportDecisionTreeClassifiertree=DecisionTreeClassifier().fit(X, y)
# Use model to predict scoring datadisplay(pd.DataFrame(tree.predict(input_data(scoring_data)), columns=['WatchTV']))
# Also display probabilistic scoresdisplay(pd.DataFrame(tree.predict_proba(input_data(scoring_data)), columns=['WatchTV_Yes_0', 'WatchTV_Yes_1']))
fromsklearn.ensembleimportBaggingClassifierbag=BaggingClassifier(tree, n_estimators=20, max_samples=0.7, random_state=1)
bag.fit(X, y)
# Use model to predict scoring datadisplay(pd.DataFrame(bag.predict(input_data(scoring_data)), columns=['WatchTV']))
# Also display probabilistic scoresdisplay(pd.DataFrame(bag.predict_proba(input_data(scoring_data)), columns=['WatchTV_Yes_0', 'WatchTV_Yes_1']))
WatchTV
0
1
1
0
WatchTV_Yes_0
WatchTV_Yes_1
0
0.00
1.00
1
0.95
0.05
We have randomized the data by fitting each estimator with a random subset of 70% of the training points. In practice, decision trees are more effectively randomized by injecting some stochasticity in how the splits are chosen: this way all the data contributes to the fit each time, but the results of the fit still have the desired randomness. In Scikit-Learn, an optimized ensemble of randomized decision trees is implemented in the RandomForestClassifier estimator, which takes care of all the randomization automatically.
fromsklearn.ensembleimportRandomForestClassifierrfclf=RandomForestClassifier()
rfclf.fit(X, y)
# Use model to predict scoring datadisplay(pd.DataFrame(rfclf.predict(input_data(scoring_data)), columns=['WatchTV']))
# Also display probabilistic scoresdisplay(pd.DataFrame(rfclf.predict_proba(input_data(scoring_data)), columns=['WatchTV_Yes_0', 'WatchTV_Yes_1']))
Bagging -- bootstrap aggregation: where all points are randomly selected -- aka without replacement. When we select a few observations more than others due to their difficulty in separating classes (and reward those trees that handle better), we are applying boosting methods. Bossting works by weighting the observations, putting more weight on difficult to classify instances and less on those already handled well. New weak learners are added sequentially that focus their training on the more difficult patterns. This means that samples that are difficult to classify receive increasing larger weights until the algorithm identifies a model that correctly classifies these samples. Replacement is allowed in boosting methods.
fromsklearn.ensembleimportGradientBoostingClassifiergbc=GradientBoostingClassifier(random_state=0)
gbc.fit(X, y)
# Use model to predict scoring datadisplay(pd.DataFrame(gbc.predict(input_data(scoring_data)), columns=['WatchTV']))
# Also display probabilistic scoresdisplay(pd.DataFrame(gbc.predict_proba(input_data(scoring_data)), columns=['WatchTV_No', 'WatchTV_Yes']))
XGBoost is an algorithm that has recently been dominating applied machine learning and Kaggle competitions for structured or tabular data. XGBoost is an implementation of gradient boosted decision trees designed for speed and performance. XGBoost stands for eXtreme Gradient Boosting.
fromxgboostimportXGBClassifierxgb=XGBClassifier()
xgb.fit(X, y)
# Use model to predict scoring datadisplay(pd.DataFrame(xgb.predict(input_data(scoring_data)), columns=['WatchTV']))
# Also display probabilistic scoresdisplay(pd.DataFrame(xgb.predict_proba(input_data(scoring_data)), columns=['WatchTV_No', 'WatchTV_Yes']))
[05:11:00] WARNING: ../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
/opt/conda/lib/python3.8/site-packages/xgboost/sklearn.py:888: UserWarning:
The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
Consider the simple case of a classification task, in which the two classes of points are well separated. While presumably any line that separates the points is decent enough, the dividing line that maximizes the margin between the two sets of points closest to the confusion edge is perhaps the best. Notice that a few of the training points just touch the margin: they are indicated by the black circles in this figure. These points are the pivotal elements of this fit, and are known as the support vectors, and give the algorithm its name. Support Vector Machines (SVMs) are parametric classification methods.
fromsklearn.preprocessingimportStandardScalerfromsklearn.svmimportLinearSVCfromsklearn.pipelineimportmake_pipelinesvm=make_pipeline(StandardScaler(), LinearSVC())
svm.fit(X, y)
# Use model to predict scoring datadisplay(pd.DataFrame(svm.predict(input_data(scoring_data)), columns=['WatchTV']))
# Also display probabilistic scoresdisplay(pd.DataFrame(svm.decision_function(input_data(scoring_data)), columns=['Z-Distance from Hyperplane']))
WatchTV
0
1
1
0
Z-Distance from Hyperplane
0
0.874192
1
-1.225382
Model Selection
So which of the models are to be selected? How do we know which are better if they all yield different results? Three points --
Non parametric methods do not assume underlying distributions, so they work better for categorical, ordinal variables.
Scikit Learn -- even for decision trees although (theoretically) supports caegorical variables -- requires that we one-hot-encode input features and output labels. Since the effort is intrinsic, it is best to let accuracy dictate model choice.
Intuition and experience -- let the palatability (passes your sniff test) and explainability (can logically articulate to others) -- guide the choice.
The ratio of correct positive predictions to the total predicted positives.
Precision = TP/TP+FP
Recall
The ratio of correct positive predictions to the total positives examples.
Recall = TP/TP+FN
Accuracy
Accuracy is defined as the ratio of correctly predicted examples by the total examples.
Accuracy = TP+TN/TP+FP+FN+TN
F1-Score
F1 Score is the weighted average of Precision and Recall. Therefore this score takes both false positives and false negatives into account.
F1 Score = 2x(Recall x Precision) / (Recall + Precision)
ROC Curve
A ROC curve (receiver operating characteristic curve) graph shows the performance of a classification model at all classification thresholds. Under normal circumstances, say binary classification we chose 0.5 as the binary separator surface, how many flip their direction when it is altered.
Cardio Model
Let us use cardio model to verify accuracy using the testing dataset. See defintion of this dataset on Kaggle
# Convert to numerical formatnumerical_cardio=pd.get_dummies(cardio_data)
display(numerical_cardio)
input_set=lambdadf: df[[colforcolindf.columnsifcol!='cardio']]
label_set=lambdadf: df.cardio
age
height
weight
sistolic_bp
diastolic_bp
smoke
alcohol
active
cardio
gender_Female
gender_Male
cholesterol_Above Normal
cholesterol_Normal
cholesterol_Way Above Normal
glucose_Above Normal
glucose_Normal
glucose_Way Above Normal
0
50
168
62.0
110
80
False
False
True
False
0
1
0
1
0
0
1
0
1
55
156
85.0
140
90
False
False
True
True
1
0
0
0
1
0
1
0
2
52
165
64.0
130
70
False
False
False
True
1
0
0
0
1
0
1
0
3
48
169
82.0
150
100
False
False
True
True
0
1
0
1
0
0
1
0
4
48
156
56.0
100
60
False
False
False
False
1
0
0
1
0
0
1
0
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
69995
53
168
76.0
120
80
True
False
True
False
0
1
0
1
0
0
1
0
69996
62
158
126.0
140
90
False
False
True
True
1
0
1
0
0
1
0
0
69997
52
183
105.0
180
90
False
True
False
True
0
1
0
0
1
0
1
0
69998
61
163
72.0
135
80
False
False
False
True
1
0
0
1
0
1
0
0
69999
56
170
72.0
120
80
False
False
True
False
1
0
1
0
0
0
1
0
70000 rows × 17 columns
Withhold Test Set for Accuracy
# Split training and test setfromsklearn.model_selectionimporttrain_test_splitX_train, X_test, y_train, y_test=train_test_split(input_set(numerical_cardio),
label_set(numerical_cardio),
test_size=1.0/7)
display(X_train)
/opt/conda/lib/python3.8/site-packages/xgboost/sklearn.py:888: UserWarning:
The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
[05:11:03] WARNING: ../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
importance_type='gain', interaction_constraints='',
learning_rate=0.300000012, max_delta_step=0, max_depth=6,
min_child_weight=1, missing=nan, monotone_constraints='()',
n_estimators=100, n_jobs=4, num_parallel_tree=1, random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
tree_method='exact', validate_parameters=1, verbosity=None)
# Predict using the model on withheld test sety_pred=gboost.predict(X_test)
display(pd.concat([pd.DataFrame(y_pred, columns=['Predicted Cardio']).reset_index(drop=True),
y_test.to_frame('Actual Withheld Cardio').reset_index(drop=True),
X_test.reset_index(drop=True)], axis=1))
The EMNIST dataset is a set of handwritten character digits derived from the NIST Special Database 19 and converted to a 28x28 pixel image format and dataset structure that directly matches the MNIST dataset. Let us use this dataset to train and detect character scribes on a piece of white paper.
We achieved great accuracy on existing MNIST sample. But if we applied to a new handwriting -- aka Seshu's -- does it perform in real world? See my scribble
fromIPython.displayimportImage, displayimportPILfromkeras.preprocessing.imageimportimg_to_array, load_imgimportcv2fromskimage.transformimportresizeasimresize# Read input scribbleimg=cv2.imread('file.png', cv2.IMREAD_GRAYSCALE)
edged=cv2.Canny(img, 10, 100)
# Detect where areas of interest exiscontours, hierarchy=cv2.findContours(edged, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
# Create a blank copy to write-overnewimg=img.copy()
# Where each blotch of ink exists, clip and detectfor (x,y,w,h) in [rectforrectin [cv2.boundingRect(ctr) forctrincontours]]:
ifw>=10andh>=50:
# Clip some border-buffer zone as well so the digit only covers 50% of the areatry:
digit_img=cv2.resize(edged[y-32:y+h+32,x-32:x+w+32], (28,28), interpolation=cv2.INTER_AREA)
# Convert clipped digit into black-n-white; MNIST standard is BW
(_, bw_img) =cv2.threshold(digit_img, 5, 255, cv2.THRESH_BINARY|cv2.THRESH_OTSU)
# Predict the scribbled letterdigit=model.predict_classes(tf.reshape(bw_img, (1,28,28,1)))[0]
# Overlay the recognized text right on top of the existing scribblecv2.putText(newimg, str(digit), (max(30, x+50), max(30, y+50)), cv2.FONT_HERSHEY_SIMPLEX, 4, 0, 8)
except:
pass# Show original imagedimage=lambdax: PIL.Image.fromarray(x).convert("L")
display(dimage(newimg).resize((500, 300)))