# 1. Keyword De-duplication + 2. Comparing Keyword Lists

## Learning Outcomes

- Learn how to de-duplicate keyword lists using set operations
- Learn how to compare two lists of keywords to using set operations
- Learn how to de-duplicate keyword lists using FuzzyWuzzy
- Learn how to create groups of keywords using FuzzyWuzzy
- Learn how to automatically de-duplicate + group keywords a .CSV file via Ahrefs
- Learn how to use  <strong> .lower() </strong> to improve de-duplication
- Learn how to use  <strong> NLP lemmmatization </strong> to improve de-duplication

--------------------------------------------------------------------------------

### De-duplicating Keywords With Set Operations

In [1]:
keyword_list_example = ['digital marketing', 'digital marketing', 'digital marketing services', 
                       'copywriting', 'seo copywriting', 'social media marketing', 'social media',
                       'digital marketing services']

Set operations allow us to easily de-duplicate a Python list which contains exact duplicates like so:

In [5]:
de_duplicated_set = set(keyword_list_example)

We can also transform the set back into a Python list and re-assign it back to the original variable keyword_list_example.

In [17]:
keyword_list_example = list(de_duplicated_set)

In [19]:
print(keyword_list_example)

['seo copywriting', 'digital marketing services', 'copywriting', 'social media', 'social media marketing', 'digital marketing']


The benefits of using this type of de-duplication is that it is natively supported within Python, however it doesn't allow us to capture partial matches. That's where <strong> FuzzyWuzzy comes in! </strong> 

--------------------------------------------------

### How to Compare Two Keyword Lists With Set Operations

We might also want to compare two keyword lists to find where there are matches between the two lists. For example if we were to take google search console data and paid search data we would like to find the following:
    

- Items that occur within both lists
- Items that occur in list_a but not in list_b
- Items that occur in list_b but not in list_a
- All of the items
- Are all of the items in list_a in list_b (boolean - True / False)?
- Are all of the items in list_b in list_a (boolean - True / False)?

Using the power of python sets we can easily find exact matches between two python lists:

In [69]:
google_search_console_keywords = ['data science, machine learning', 'ml consulting', 'digital marketing', 'seo', 
                                  'social media', 'data science consultants', 'search engine optimisation']
paid_search_keywords = ['data science consultants', 'digital marketing', 'seo', 
                        'search engine optimisation']

The .intersection() functionality allows us to find keywords that appear in both lists:

In [70]:
set(google_search_console_keywords).intersection(set(paid_search_keywords))

{'data science consultants',
 'digital marketing',
 'search engine optimisation',
 'seo'}

The example below shows all of the keywords within google_search_console_keywords that are not within the paid search keywords list:

In [71]:
set(google_search_console_keywords) - set(paid_search_keywords)

{'data science, machine learning', 'ml consulting', 'social media'}

The example below shows the exact opposite, all of the keywords that appear within the paid search keywords list that aren't in the google search console keywords list:


In [72]:
set(paid_search_keywords) - set(google_search_console_keywords)

set()

We can also easily extract all of the total exact match keywords from both lists with the following set operation:

In [73]:
set(google_search_console_keywords).union(paid_search_keywords)

{'data science consultants',
 'data science, machine learning',
 'digital marketing',
 'ml consulting',
 'search engine optimisation',
 'seo',
 'social media'}

The below commands allow us to search to see if all elemnents within the google_search_console keywords list are within the 
paid_search_keywords list and vice versa!

In [74]:
all_gsc_keywords_in_paid_search = set(google_search_console_keywords).issubset(set(paid_search_keywords))
all_paid_search_keywords_in_gsc_list = set(paid_search_keywords).issubset(set(google_search_console_keywords))

In [75]:
print(all_gsc_keywords_in_paid_search, all_paid_search_keywords_in_gsc_list)

False True


------------------------------------------------------------------------------------------------------------

### De-duplicating Keywords With [FuzzyWuzzy](https://github.com/seatgeek/fuzzywuzzy) 

So what's [fuzzywuzzy?](https://github.com/seatgeek/fuzzywuzzy) Its a string matching python package that allows us to easily calculate the difference between several strings via [Levenshtein Distance.](https://en.wikipedia.org/wiki/Levenshtein_distance)

Firstly we'll need to install the python package called fuzzywuzzy with:

~~~
pip install fuzzywuzzy
~~~

---

As a side-note, anytime you install python packages you will need to restart the python ikernel to use them within a Jupyter Notebook <strong> (click Kernel at the top, then click Restart & Clear Output). </strong>

---

One of the most useful functions from the [fuzzywuzzy package](https://github.com/seatgeek/fuzzywuzzy) is the .dedupe() function, so let's see it in action!


In [8]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

In [21]:
print(process.dedupe(keyword_list_example, threshold=70))

dict_keys(['seo copywriting', 'digital marketing services', 'social media marketing'])


What's important to notice here is that the longer phrases have been chosen as the de-duplicated keywords. Whilst this approach will provide us with a list of duplicates it does come at the cost of loosing what keywords were lost whilst doing the de-duplication.

To get the original of keywords, we can take the de-duplicated dict_keys and simply transform it into a list.

In [25]:
de_duplicated_keywords = list(process.dedupe(keyword_list_example, threshold=70))
print(de_duplicated_keywords, type(de_duplicated_keywords))

['seo copywriting', 'digital marketing services', 'social media marketing'] <class 'list'>


------------------------------------------------------

Also its important to notice that there is a threshold argument which we can pass into process.dedupe:

~~~
process.dedupe(keyword_list, threshold=70)
~~~

This number can range from 0 - 100, the intuition behind it is that as the threshold increases we are only de-duplicating keywords that are more closely related to each other. Therefore if we set a low threshold, we will get more de-duplication, however it might come at the cost of quality de-duplication. 

There is a balance to using the threshold parameter and I encourage you to try different different threshold values. So let's do just that! :) 

We can use the following code to loop over a range of numbers (stepping up with increments of 10):

~~~

for i in range(10, 100, 10):
    print(i)
    # This will start at the number 10 and will increment in steps of 10 up to 90

~~~

In [27]:
threhold_dictionary = {}

In [39]:
for i in range(10, 100, 10):
    # Increasing the number of steps from 10 --> 90 by 10 at a time:
    print(f"This is the current threhold value: {i}")
    # Performing the de-duplication
    de_duplicated_items = process.dedupe(keyword_list_example, threshold=i)
    # Storing the de-duplicated items
    threhold_dictionary[i] = list(de_duplicated_items)

This is the current threhold value: 10
This is the current threhold value: 20
This is the current threhold value: 30
This is the current threhold value: 40
This is the current threhold value: 50
This is the current threhold value: 60
This is the current threhold value: 70
This is the current threhold value: 80
This is the current threhold value: 90


In [76]:
print(threhold_dictionary)

{10: ['digital marketing services'], 20: ['digital marketing services'], 30: ['digital marketing services', 'digital marketing'], 40: ['social media marketing', 'digital marketing services', 'seo copywriting'], 50: ['seo copywriting', 'digital marketing services', 'social media marketing'], 60: ['seo copywriting', 'digital marketing services', 'social media marketing'], 70: ['seo copywriting', 'digital marketing services', 'social media marketing'], 80: ['seo copywriting', 'digital marketing services', 'social media marketing'], 90: ['seo copywriting', 'digital marketing services', 'social media marketing']}


---------------------------------------------------------------

### Grouping de-dupicated keywords with FuzzyWuzzy

As well as de-duplicating lists of keywords, it would also be useful to keep all of the keywords but bucket them into keyword groups based upon how close every keyword was as a duplicate in reference to every keyword.

------------------------------------------------------------------------------------------------------------