<a href="https://colab.research.google.com/github/nalinis07/APT_Class_Copy_Links/blob/MASTER/APT_Lesson_88_Class_Copy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson 88: Internet Advertisement Classification

### Teacher-Student Activities

In the previous classes, you learnt to classify images by building a support vector machine model. In this class, you will solve another problem statement based on binary classification using SVM.

Some company or companies created their advertisements to launch on social media platforms in the past. Internet users see several images on different social media platforms: advertisement images or something else.

So from the survey of images, a dataset is created wherein advertisements images and other images are identified. We need to build a classification model that can detect advertisement and non-advertisement images.

Let's go through the dataset description for this problem statement.

**Data Description**

There are 1559 columns in the data. Each row in the data represents one image that is tagged as `ad.` or `nonad.` in the last column. Here's the description of each column:

- **`Unnamed: 0`**: Unique ID of each image
- **`0`**: Height of an image
- **`1`**: Width of an image
- **`2`**: Aspect ratio (ratio of width to the height) of an image
- Columns **`3`** to **`1557`**: Pixel values of an image
- **`1558`**: Whether the image belongs to an advertisement or not

**Dataset credits:** *https://archive.ics.uci.edu/ml/datasets/Internet+Advertisements*

**Citation:**

Dua, D., & Graff, C.. (2017). UCI Machine Learning Repository.






---

#### Activity 1: Loading Dataset^

Let's import all the required Python modules and load the dataset. Here's the link to the dataset:

https://s3-student-datasets-bucket.whjr.online/whitehat-ds-datasets/ad-or-nonad.csv

In [1]:
# S1.1: Import all the required Python modules and load the dataset.


Before going ahead, let's drop the `Unnamed: 0` column from the data frame because we would never need it to build a classification model using support vector machines.

In [2]:
# S1.2: Drop the 'Unnamed: 0' column from the data frame.


Now, let's get some information about the dataset such as the number of rows and columns, any missing values and the type of values in each column.

In [3]:
# S1.3: Get information about the dataset.


As you can see, there are 3279 rows and 1569 columns in the dataset. Almost all the columns have numeric type values and the remaining five columns have non-numeric values.

Now, let's rename the columns
- `0` as `height`
- `1` as `width`
- `2` as `aspect ratio` and
- `1558` as `target`

because that is what these columns represent.

To rename the columns of Pandas DataFrame, call the `rename()` function of the `Pandas` module on a `DataFrame` object. Inside the function, pass a Python dictionary containing the old column names and new column names as the key-value pairs.

**Syntax:** `data_frame.rename({old_col_name1 : new_col_name1, old_col_name2 : new_col_name2, ... old_col_nameN : new_col_nameN})`

where

- `data_frame` is a Pandas `DataFrame` object and

- `old_col_name1 : new_col_name1, old_col_name2 : new_col_name2, ... old_col_nameN : new_col_nameN` are key-value pairs in a Python dictionary.

In [4]:
# S1.4: Rename the columns '0, 1, 2' and '1558' of the data frame with their correct names as stated above.


As you can see, the column names are changed as we required. Now let's check for the missing values in each column in the data frame.

In [5]:
# S1.5: Check for the missing values in each column in the data frame.


Since there are 1559 columns to check, let's again call the `sum()` function on the Pandas series generated by the current `sum()` function to check the total sum of the missing values in all the columns. If there are no missing values, then the total sum should be 0 else it should be greater than 0.

In [6]:
# S1.6: Calculatet the total sum of the missing (or null) values in all the columns in the data frame.


So there are no missing values in the data frame.

---

#### Activity 2: The `pandas.to_numeric` Function^^

Now, let's convert the numeric values reported as non-numeric values to numeric values in the data frame in all the columns except in the `target` column.

For this, you can use the `to_numeric` function of the Pandas module. To use this function, you need to

1. Call the `apply()` function on the `DataFrame` object.

2. Inside the `apply()` function, pass `pandas.to_numeric` as an input.

**Note:** This exercise will throw `ValueError` because a few columns contain some unwanted values.

In [None]:
# S2.1: Convert the numeric values reported as non-numeric values to numeric values.
new_df = df[df.columns[:-1]].apply(pd.to_numeric)
new_df.info()

When we tried to change the data type of a certain value in a column we got `ValueError`. This is because at the 10th position or (10th row), as shown by the error message, there is some unwanted value. We need to remove this unwanted value wherever it is present in the data frame. To do this:

1. First, we need to create a function that can list out all the columns containing the unwanted value (here, `"   ?"`).

2. Then, we will remove/replace the unwanted values from each column separately.



In [None]:
# S2.2: Create a function that can list out all the columns containing the unwanted value.
def unwanted_values_finder(string):
  unwanted_value_col = []
  for col in df.columns:
    if np.sum(df[col]==string) > 0:
      unwanted_value_col.append(col)
  return unwanted_value_col
# Now, use the function created above to find the columns containing the "   ?" value.
unwanted_values_finder("   ?")

So the columns `'height'` and `'width'` contain the `"   ?"` values.

Let's create a data frame wherein the values in the `height` column are `"   ?"` only.

In [None]:
# S2.3: Create a data frame wherein the values in the height columns are "   ?" only.


There are `903` rows in the above data frame.

Here we can see that the columns `height, width` and	`aspect ratio` seem to have the question mark as the unwanted values. However, when we tried to find the question mark in all the columns, the function that we created to find unwanted values returned only the `height` and `width` columns but not the `aspect ratio` column. **Can you think of a reason why this would happen?** Think about this question for a couple of minutes and try to find the reason.

Let's try to apply the `pandas.to_numeric()` function only on the `aspect ratio` column and find out if we again get the `ValuerError` message. If we do, then what is the unwanted value returned by it.

In [None]:
# S2.4: Apply the 'pandas.to_numeric()' function only on the 'aspect ratio' column


This time again we got the `ValueError` message. However, the unwanted value returned also has the question mark but followed by more spaces again use the unwanted values finder function to list out the columns containing the question mark with more spaces.

In [None]:
# S2.5: Use the unwanted values finder function to list out the columns containing the question mark with more spaces i.e., "     ?"


So only the `aspect ratio` column contain the question mark with more spaces.

Let's create a data frame wherein the values in the `width` column are `" ?"` only.

In [None]:
# S2.6: Create a data frame wherein the values in the 'width' column are " ?" only.


The above data frame contains 901 rows.

Here too the columns `height, width` and `aspect ratio` seem to have the question mark as the unwanted values.

Let's create a data frame wherein both the `height` and `width` columns have the question mark as unwanted values.

In [None]:
# S2.7: Create a new data frame wherein both the 'height' and 'width' columns have the question mark as unwanted values.


This data frame contains 894 rows wherein both the `height` and `width` columns have the question mark as unwanted values.

Let's find out the number of the images classified as `ad.` and `nonad.` in the above data frame.

In [None]:
# S2.8: Calculate the number of the images classified as 'ad.' and 'nonad.' in the above data frame.


Most of the images having unwanted values or non-advertisement images and very few advertisement images compared to the total number of images in the data frame.

Let's calculate the percentage of such values.

In [None]:
# S2.9: Calculate the percentage of the unwanted values in non-advertisement and advertisement image

So if we remove all the rows from the primary data frame wherein both the `'height'` and `'width'` columns have the question mark as unwanted values, we will lose the 29% of the images marked as `nonad.` and about 16% of the images marked as `ad.` Unfortunately, we have to remove all these rows because there is no way to retrieve the dimensions of the images.

---

#### Activity 3: Removing Rows from Pandas DataFrame^^^

To **remove** all the rows containing question marks in both `height` and `width` columns:

1. Retrieve all the rows containing question marks in both `height` and `width` columns

2. Use the tilde (`~`) symbol to reverse (or conjugate or negate) the conditional statement used to retrieve all the rows containing question marks in both `height` and `width` columns.

In [None]:
# S3.1: Remove all the rows containing question marks in both height and width columns
df = df[~((df["height"]=="   ?")&(df["width"]=="   ?"))]
df

After removing the required rows, the data frame now has 2385 rows.

Now let's check whether there are still more rows containing the question mark in the `height` column.

In [None]:
# S3.2: Check whether there are still more rows containing the question mark in the height column.


There are 9 rows containing question marks in the `height` column. In the same rows, the `aspect ratio` column as well contains unwanted values.

Let's calculate the count of the non-advertisement and advertisement images in the above 9 rows.

In [None]:
# S3.3: Calculate the count of the non-advertisement and advertisement images in the above 9 rows.


All the above 9 rows belong to the non-advertisement images. Let's remove these 9 rows as well because two (`height, aspect ratio`) out of three (`height, width, aspect ratio`) values are not available. If any two out of three values are available, then we can calculate the third because

\begin{aligned}
\text{Aspect ratio} = \frac{\text{width}}{\text{height}}
\end{aligned}

Now remove all the above 9 rows. You don't have to use the tilde (`~`) symbol this time.

In [None]:
# S3.4: Remove the above 9 rows containing unwanted values in both 'height' and 'aspect ratio' columns.


**Note:** The $!=$ symbol in a programming language is the same as the $\neq$ symbol in mathematics.

Now we are left with 2376 rows in the data frame.

Let's check whether there are still more rows containing the question mark in the `width` column.

In [None]:
# S3.5: Check whether there are still more rows containing the question mark in the width column.


There are 7 rows containing question marks in the `width` column. In the same rows, the `aspect ratio` column as well contains unwanted values.

Let's calculate the count of the non-advertisement and advertisement images in the above 7 rows.

In [None]:
# S3.6: Calculate the count of the non-advertisement and advertisement images in the above 7 rows.


All the above 7 rows belong to the non-advertisement images. Let's remove these 7 rows as well because two (`width, aspect ratio`) out of three (`height, width, aspect ratio`) values are not available. If any two out of three values are available, then we can calculate the third because

\begin{aligned}
\text{Aspect ratio} = \frac{\text{width}}{\text{height}}
\end{aligned}

Now remove all the above 7 rows. Again, you won't have to use the tilde (`~`) symbol.

In [None]:
# S3.7: Remove the above 7 rows containing unwanted values in both 'height' and 'aspect ratio' columns.


We are left with 2369 rows in the data frame.

Now let's check whether there are still more rows containing the question mark in the `aspect ratio` column.

**Note:** There are more spaces before the question mark in the `aspect ratio` column.

In [None]:
# S3.8: Check whether there are still more rows containing the question mark in the aspect ratio column.


All the rows containing the unwanted values are removed.

Now let's try to convert the numeric values reported as the non-numeric values to numeric values in all the columns except in the `target` column.

In [None]:
# S3.9: Convert the numeric values reported as the non-numeric values to numeric values in all the columns except in the target column.


Once again we have found question mark as an unwanted value. This time, it doesn't follow any spaces.

Let's list out the rows containing `"?"` values.

In [None]:
# S3.10: List out the rows containing "?" values.


So only the column titled `'3'` has unwanted values. Let's create a data frame for the same.

In [None]:
# S3.11: Create a data frame containing only "?" in the column titled '3'.


10 rows are containing unwanted values in the column titled `'3'`. Let's remove these rows as well.

In [None]:
# S3.12: Remove the rows containing unwanted values in the column titled '3'.


Initially, we had 3279 rows and now we are left with 2359 rows i.e., approximately 72% of the initial data frame. This is all right. We still have a significant chunk of the data frame left with us after removing all the rows containing unwanted values from the initial data frame.

Now, let's again try to convert the numeric values reported as the non-numeric values to numeric values in all the columns except in the `target` column.

In [None]:
# S3.13: Convert the numeric values reported as the non-numeric values to numeric values in all the columns except in the target column.




Finally, we have converted all the required values to numeric types. You can verify this by calling the `describe()` function on the newly obtained data frame.

In [None]:
# S3.14: Call the 'describe()' function on the numeric-type columns on the data frame obtained above.


Let's stop here. In the next class, we will build an SVC model to classify the images as non-advertisement and advertisement images.

---

### **Project**
You can now attempt the **Applied Tech. Project 88 - Internet Advertisement Classification** on your own.

**Applied Tech. Project 88 - Internet Advertisement Classification**: https://colab.research.google.com/drive/1mHAUxfohlSzh7RxoDgrnlbM9oNiMQCer

---