# Searching for the category

For this code along we are only going to use the products DataFrame. However, if you believe there is information in other tables that can help to create categories, please feel free to explore.

In [None]:
import pandas as pd

In [None]:
# products_cl.csv
url = "https://drive.google.com/file/d/1s7Lai4NSlsYjGEPg1QSOUJobNYVsZBOJ/view?usp=sharing" 
path = "https://drive.google.com/uc?export=download&id="+url.split("/")[-2]
products_cl = pd.read_csv(path)

In [None]:
product_category_df = products_cl.copy()

In [None]:
product_category_df.head()

## 1.&nbsp; Category creation by search term
Let's start by creating a column `category`. For now we'll fill this column with a blank string `""`.

In [None]:
product_category_df["category"] = ""
product_category_df.head()

We can find all the products with certain words in their `description` using `.loc[]` and `.str.contains()`. Here we'll look at all the items that have the word `keyboard` in their description.

In [None]:
product_category_df.loc[product_category_df["desc"].str.lower().str.contains("keyboard"), :]

Next, we change the value in the category column to `keyboard` for all of these keyboard products. 

In [None]:
product_category_df.loc[product_category_df["desc"].str.lower().str.contains("keyboard"), "category"] = "keyboard"

Let's take a look at the effect that had on the `category` column.

In [None]:
product_category_df["category"].value_counts()

## 2.&nbsp; Category creation using regex
We can also use a product's `name` to select products for our categories.

In [None]:
product_category_df.loc[product_category_df["name"].str.lower().str.contains("apple iphone"), :]

Looks like we get a lot of accessories included in this search. We can refine this using a little regex. Here, we will add `.{0,7}` at the beginning of the search: this means we will find all `apple iphone`s that have 7 or less characters preceding the term "apple iphone" - if there's 8 characters preceding the search term, it won't be found. This should help refine our search by using the nomenclature of the DataFrame to our advantage.

If you feel unsure about regex, please use [regex101](https://regex101.com/). It's really useful for checking your code, and parts of other people's code that you're unsure about.

In [None]:
product_category_df.loc[product_category_df["name"].str.lower().str.contains("^.{0,7}apple iphone"), :]

Now we can use the same trick as before to set the category - selecting the `category` column and setting it to the string of our choice.

In [None]:
product_category_df.loc[product_category_df["name"].str.lower().str.contains("^.{0,7}apple iphone"), "category"] = "smartphone"

In [None]:
product_category_df["category"].value_counts()

## 3.&nbsp; One product with multiple categories
A product may fit into multiple categories. To help us create multiple categories for one product, we will use the python addition assignment `+=`. The addition assignment is a shorthand way to add something (number, string, etc...) to a variable without changing the variable name. 

Let's have a look at a couple of examples.

In [None]:
a = 10
a = a + 5
a

In [None]:
a = 10
a += 5
a

In [None]:
b = "Tyrannosaurus"
b = b + " rex"
b

In [None]:
b = "Tyrannosaurus"
b += " rex"
b

Now let's look at how this can help us in our category creation.

First, we'll reset all the values in the category column to an empty string `""`.

In [None]:
product_category_df["category"] = ""

Now, let's create some categories and utilise the addition assignment.

In [None]:
product_category_df.loc[product_category_df["desc"].str.lower().str.contains("keyboard"), "category"] += ", keyboard"
product_category_df.loc[product_category_df["name"].str.lower().str.contains("^.{0,7}apple iphone"), "category"] += ", smartphone"
product_category_df.loc[product_category_df["name"].str.lower().str.contains("^.{0,7}apple ipod"), "category"] += ", ipod"
product_category_df.loc[product_category_df["name"].str.lower().str.contains("^.{0,7}apple ipad|tablet"), "category"] += ", tablet"
product_category_df.loc[product_category_df["name"].str.lower().str.contains("imac|mac mini|mac pro"), "category"] += ", desktop"

In [None]:
product_category_df["category"].value_counts()

As you can see, some products now have 2 categories instead of just one. At the end, you can use your skills with string to tidy up the opening comma and space in the `category` column.

# Challenge. Your categories
Now it's your turn. We'll reset the Dataframe so that no categories exist, and it's up to you to create the categories based on keywords in the name and description. Feel free to go wild and make as many categories as you like.
* Remember you can also use regex to refine your searches.
* Remember you can use the or operator `|` to search for multiple terms at once.
* Remember to tidy up any untidy strings at the end.

In [None]:
# your code here

## 4.&nbsp; [BONUS] Using `type` to create categories
There could be another way to create categories, but this one you'll have to explore this one alone.

We have the mysterious column `type` in the `products` table. This could potentially be ready-made categories labelled with numbers instead of words. Let's investigate.

In [None]:
category_type_df = products_cl.copy()

Here are the `type`s that have the most products.

In [None]:
category_type_df.groupby("type").count().nlargest(10, "sku")

Let's have a look at the first `type` to see if we can make categories from this column.

In [None]:
category_type_df.loc[category_type_df["type"] == "11865403", :].sample(10)

Looks like this is a category of phone cases.

Let's have a look at the 2nd largest type to see if that's also a clear category.

In [None]:
category_type_df.loc[category_type_df["type"] == "12175397", :].sample(10)

Looks like this category is full of servers.

I wonder how many `type`s account for most of our products?

In [None]:
n = 30
print(f"With the {n} largest types, we account for {((category_type_df.groupby('type').count().nlargest(n, 'sku')['sku'].sum()) / (category_type_df.shape[0]) * 100).round(2)}% of all products.")

Looks like we can simply investigate 30 types and set the categories, then the remaining 20% of products can have the category `other`.

Use the skills you learnt above to change the category for each type.