<a href="https://colab.research.google.com/github/parekhakhil/pyImageSearch/blob/main/604-understanding_regularization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Understanding regularization for image classification and machine learning



This notebook is associated with the [Understanding regularization for image classification and machine learning](https://www.pyimagesearch.com/2016/09/19/understanding-regularization-for-image-classification-and-machine-learning/) blog post published on 2016-09-19.

Only the code for the blog post is here. Most codeblocks have a 1:1 relationship with what you find in the blog post with two exceptions: (1) Python classes are not separate files as they are typically organized with PyImageSearch projects, and (2) Command Line Argument parsing is replaced with an `args` dictionary that you can manipulate as needed.

We recommend that you execute (press ▶️) the code block-by-block, as-is, before adjusting parameters and `args` inputs. Once you've verified that the code is working, you are welcome to hack with it and learn from manipulating inputs, settings, and parameters. For more information on using Jupyter and Colab, please refer to these resources:

*   [Jupyter Notebook User Interface](https://jupyter-notebook.readthedocs.io/en/stable/notebook.html#notebook-user-interface)
*   [Overview of Google Colaboratory Features](https://colab.research.google.com/notebooks/basic_features_overview.ipynb)


Happy hacking!


<hr>



### Download the code zip file

In [None]:
!wget https://www.pyimagesearch.com/wp-content/uploads/2016/08/understanding-regularization.zip
!unzip -qq understanding-regularization.zip
%cd understanding-regularization

### Downloading the dataset

Since this dataset is hosted on Kaggle, we have a few options to get the dataset from Kaggle to our Colab Notebook environment - 

* Download the dataset from Kaggle as a zipped file, upload that to our Google Drive, and then mount Google Drive on Colab to access the uploaded dataset. 
* Use the [Kaggle API](https://github.com/Kaggle/kaggle-api) to directly download the dataset to our Colab Notebook environment. 

We will be using the second option. **Note** that you need to obtain your Kaggle API keys to perform this step. Follow the instructions [here](https://github.com/Kaggle/kaggle-api) in order to obtain your Kaggle API keys in case you don't have them. 

Now, navigate to File Browser of Colab and upload your keys following [this screencast](https://www.loom.com/share/ca76bb983e0844e2a7f14b473d7287c6). After the keys file has been uploaded, we need to move it to an appropriate location. 

In [None]:
!mkdir ~/.kaggle
!cp /content/kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

Now, we can download the dataset. 

In [None]:
!kaggle competitions download -c dogs-vs-cats -f train.zip

To follow along with the rest of the code, please move the download dataset by executing the command below. 

In [None]:
!mkdir kaggle_dogs_vs_cats
!unzip --qq train.zip -d kaggle_dogs_vs_cats

## Blog Post Code

### Import Packages

In [None]:
# import the necessary packages
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from imutils import paths
import numpy as np
import argparse
import imutils
import cv2
import os

### Image classification using regularization with Python and scikit-learn

In [None]:
def extract_color_histogram(image, bins=(8, 8, 8)):
	# extract a 3D color histogram from the HSV color space using
	# the supplied number of `bins` per channel
	hsv = cv2.cvtColor(image, cv2.COLOR_BGR2HSV)
	hist = cv2.calcHist([hsv], [0, 1, 2], None, bins,
		[0, 180, 0, 256, 0, 256])

	# handle normalizing the histogram if we are using OpenCV 2.4.X
	if imutils.is_cv2():
		hist = cv2.normalize(hist)

	# otherwise, perform "in place" normalization in OpenCV 3 (I
	# personally hate the way this is done
	else:
		cv2.normalize(hist, hist)

	# return the flattened histogram as the feature vector
	return hist.flatten()

In [None]:
# construct the argument parser and parse the arguments
# ap = argparse.ArgumentParser()
# ap.add_argument("-d", "--dataset", required=True,
# 	help="path to input dataset")
# args = vars(ap.parse_args())

# since we are using Jupyter Notebooks we can replace our argument
# parsing code with *hard coded* arguments and values
args = {
	"dataset": "kaggle_dogs_vs_cats"
}

In [None]:
# grab the list of images that we'll be describing
print("[INFO] describing images...")
imagePaths = list(paths.list_images(args["dataset"]))

# initialize the data matrix and labels list
data = []
labels = []

In [None]:
# loop over the input images
for (i, imagePath) in enumerate(imagePaths):
	# load the image and extract the class label (assuming that our
	# path as the format: /path/to/dataset/{class}.{image_num}.jpg
	image = cv2.imread(imagePath)
	label = imagePath.split(os.path.sep)[-1].split(".")[0]

	# extract a color histogram from the image, then update the
	# data matrix and labels list
	hist = extract_color_histogram(image)
	data.append(hist)
	labels.append(label)

	# show an update every 1,000 images
	if i > 0 and i % 1000 == 0:
		print("[INFO] processed {}/{}".format(i, len(imagePaths)))

In [None]:
# encode the labels, converting them from strings to integers
le = LabelEncoder()
labels = le.fit_transform(labels)

# partition the data into training and testing splits, using 75%
# of the data for training and the remaining 25% for testing
print("[INFO] constructing training/testing split...")
(trainData, testData, trainLabels, testLabels) = train_test_split(
	np.array(data), labels, test_size=0.25, random_state=42)

In [None]:
# loop over our set of regularizers
for r in (None, "l1", "l2", "elasticnet"):
	# train a Stochastic Gradient Descent classifier using a softmax
	# loss function, the specified regularizer, and 10 epochs
	print("[INFO] training model with `{}` penalty".format(r))
	model = SGDClassifier(loss="log", penalty=r, random_state=967,
		max_iter=10)
	model.fit(trainData, trainLabels)

	# evaluate the classifier
	acc = model.score(testData, testLabels)
	print("[INFO] `{}` penalty accuracy: {:.2f}%".format(r, acc * 100))

For a detailed walkthrough of the concepts and code, be sure to refer to the full tutorial, [*Understanding regularization for image classification and machine learning*](https://www.pyimagesearch.com/2016/09/19/understanding-regularization-for-image-classification-and-machine-learning/) published on 2016-09-19.