# Symantic Comparison and String Clustering

So remember, at the start of this whole exercise we learned that we are looking at a series of search queries-- probably made against a search engine or something similar.

In this notebook, we're going to walk through a few techniques to compare those strings so that we can get a better understanding of which of these are similar and for which reasons.

Symantic comparison and string matching can be a little tricky to understand, so I'll make sure to try to explain all of the concepts thoroughly.

In [None]:
import pandas as pd

# If you need help importing data or understanding what is happening in this cell, check out this notebook: ~/uploading-and-inspecting-data/notebook.ipynb

file_path = '../data/data.csv'

df = pd.read_csv(file_path)

df.head()

Unnamed: 0,search_query,time,platform
0,vacation spots recipe,1722760240,mobile
1,what is invest in crypto,1734252942,mobile
2,AI tools recipe,1728010297,desktop
3,best camera symptoms,1730697978,mobile
4,best meditation near me,1735175397,mobile


## Having a closer look

Now that we've created our DataFrame, we need to normalize all of this text data so that we can more readily make some comparisons between rows. Extra characters, strange (or even typical) capitalization of characters, whitespace, and tons of other things that might seem innocuous to a human reader, can actually have a really large impact on the ability of machines to compare two strings.

In [6]:
# The first thing we'll do is convert all of our queries to lower case
queries = df['search_query'].str.lower()

# The next thing we'll do is remove any leading or trailing whitespace
queries = queries.str.strip()

queries.head()

0       vacation spots recipe
1    what is invest in crypto
2             ai tools recipe
3        best camera symptoms
4     best meditation near me
Name: search_query, dtype: object

## Moving on...

We can see above that we've changed words like "AI" to "ai". By ensuring that our data is normalized, we'll make it a lot easier to compare strings that may be symatically very similar, but structurally slightly different. 

Your dataset may already be normalized, in which case, you might be able to skip this step (unless you want to look through you entire, potentially quite large, dataset to be sure though, I always recommend running the above cell just in case. It takes way less time that it would if you wanted to look through the entire dataset!)

Okay, now that we've normalized all of our strings, we need to convert them to a language that machines can understand, and what better language than **numbers** for that purpose?!

For now, it's enough to know that `TfidfVectorizer` is a function that converts you human-readable string into a number-based matrix that can be used for comparison. What's happening under the hood is actually quite interesting as well though, so make sure you read the [`sklearn` documentation](https://scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.html#sphx-glr-auto-examples-text-plot-document-classification-20newsgroups-py) to learn more!

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(queries)


## A word about stop words

Before we move on with the comparison, let's talk about stop words.

Stop words are common words in the English language that should also be removed during text-based comparisons. Things like `a` or `and` or `or` can also skew the results of comparison. For example the string `lions and tigers and bears` is far less symatically similar to the string `happy and joyful and excited` if we remove the stop word `and` which doesn't actually have much relevant meaning for our purposes here.

That said, this is an English-language tutorial, and we all know that data science is practiced in nearly every language on the globe. If you're using `sklearn` whilst anaylzing data in another language (or even, if you have a domain-specific dataset that contains certain words which occur over and over again but don't deliver any symantic value to the underlying data), add a custom dictionary of stop words is a great way to improve your analysis.

In [None]:
# First, let's have a look at the stopwords contained in the default `english` parameter

from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

# We only show the first 10 because the amount of stopwords is quite large
print(sorted(list(ENGLISH_STOP_WORDS))[:10])
print('Total stop words:')
print(len(ENGLISH_STOP_WORDS))

['a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost']
Total stop words:
318


## Creating a custom stopword dictionary

Let's consider a case where we're going to be working with a Russian-language dataset. If we remove English stopwords alone, we're going to run into the same matching issues discussed above, but this time for the Russian language.

This part of the tutorial is going to focus on creating a custom stopword dictionary and using it instead of the default `english` dictionary.

(If you'd just like to update the default `english` language stop words, have a look at this [Stack Overflow issue](https://stackoverflow.com/questions/24386489/adding-words-to-scikit-learns-countvectorizers-stop-list) which describes doing just that!)

In [18]:
# First, let's define our custom stopwords.
custom_stop_words = [
    'в', 'на', 'по', 'о', 'об', 'про', 'при', 'за', 'под', 'над', 'с', 'без', 'до', 'после', 'из', 'от', 'к', 'у', 'через',
    'и', 'но', 'или', 'а', 'да', 'потому что', 'так как', 'если', 'когда', 'чтобы',
    'не', 'ни', 'же', 'ведь', 'уж', 'бы', 'ли', 'разве', 'неужели',
    'я', 'ты', 'он', 'она', 'оно', 'мы', 'вы', 'они', 'свой', 'наш', 'ваш',
    'весь', 'сам', 'самый', 'каждый', 'любой', 'оба', 'тоже', 'также',
    'это', 'там', 'тут', 'тогда', 'сейчас', 'всегда', 'никогда', 'часто',
    'конечно', 'может', 'должен', 'можно', 'нужно',
    'вся', 'сама', 'самая', 'каждая', 'любая', 'обе', 'эта',
    'там', 'тут', 'тогда', 'сейчас', 'всегда', 'никогда', 'часто',
    'конечно', 'может', 'должна', 'можно', 'нужно'
]