# 1. Keyword De-duplication + 2. Comparing Keyword Lists

In [201]:
import pandas as pd

## Learning Outcomes

- Learn how to de-duplicate keyword lists using set operations
- Learn how to compare two lists of keywords to using set operations
- Learn how to de-duplicate keyword lists using FuzzyWuzzy
- Learn how to open an Ahrefs keyword report .CSV file with Pandas
- Performing standard data analysis operations within Pandas on a keyword report from Ahrefs (including GroupBy Objects, DataFrame Subsetting, the .drop_duplicates() method, the .apply() method and how to save your new dataframe to a .CSV file.
- Learn how to create groups of keywords using FuzzyWuzzy + a custom keyword grouping function
- Learn how to use  <strong> .lower() </strong> to improve de-duplication
- Learn how to use  <strong> NLP lemmmatization </strong> to improve de-duplication

--------------------------------------------------------------------------------

### De-duplicating Keywords With Set Operations

In [1]:
keyword_list_example = ['digital marketing', 'digital marketing', 'digital marketing services', 
                       'copywriting', 'seo copywriting', 'social media marketing', 'social media',
                       'digital marketing services']

Set operations allow us to easily de-duplicate a Python list which contains exact duplicates like so:

In [5]:
de_duplicated_set = set(keyword_list_example)

We can also transform the set back into a Python list and re-assign it back to the original variable keyword_list_example.

In [17]:
keyword_list_example = list(de_duplicated_set)

In [19]:
print(keyword_list_example)

['seo copywriting', 'digital marketing services', 'copywriting', 'social media', 'social media marketing', 'digital marketing']


The benefits of using this type of de-duplication is that it is natively supported within Python, however it doesn't allow us to capture partial matches. That's where <strong> FuzzyWuzzy comes in! </strong> 

--------------------------------------------------

### How to Compare Two Keyword Lists With Set Operations

We might also want to compare two keyword lists to find where there are matches between the two lists. For example if we were to take google search console data and paid search data we would like to find the following:
    

- Items that occur within both lists
- Items that occur in list_a but not in list_b
- Items that occur in list_b but not in list_a
- All of the items
- Are all of the items in list_a in list_b (boolean - True / False)?
- Are all of the items in list_b in list_a (boolean - True / False)?

Using the power of python sets we can easily find exact matches between two python lists:

In [69]:
google_search_console_keywords = ['data science, machine learning', 'ml consulting', 'digital marketing', 'seo', 
                                  'social media', 'data science consultants', 'search engine optimisation']
paid_search_keywords = ['data science consultants', 'digital marketing', 'seo', 
                        'search engine optimisation']

The .intersection() functionality allows us to find keywords that appear in both lists:

In [70]:
set(google_search_console_keywords).intersection(set(paid_search_keywords))

{'data science consultants',
 'digital marketing',
 'search engine optimisation',
 'seo'}

The example below shows all of the keywords within google_search_console_keywords that are not within the paid search keywords list:

In [71]:
set(google_search_console_keywords) - set(paid_search_keywords)

{'data science, machine learning', 'ml consulting', 'social media'}

The example below shows the exact opposite, all of the keywords that appear within the paid search keywords list that aren't in the google search console keywords list:


In [72]:
set(paid_search_keywords) - set(google_search_console_keywords)

set()

We can also easily extract all of the total exact match keywords from both lists with the following set operation:

In [73]:
set(google_search_console_keywords).union(paid_search_keywords)

{'data science consultants',
 'data science, machine learning',
 'digital marketing',
 'ml consulting',
 'search engine optimisation',
 'seo',
 'social media'}

The below commands allow us to search to see if all elemnents within the google_search_console keywords list are within the 
paid_search_keywords list and vice versa!

In [74]:
all_gsc_keywords_in_paid_search = set(google_search_console_keywords).issubset(set(paid_search_keywords))
all_paid_search_keywords_in_gsc_list = set(paid_search_keywords).issubset(set(google_search_console_keywords))

In [75]:
print(all_gsc_keywords_in_paid_search, all_paid_search_keywords_in_gsc_list)

False True


------------------------------------------------------------------------------------------------------------

### De-duplicating Keywords With [FuzzyWuzzy](https://github.com/seatgeek/fuzzywuzzy) 

So what's [fuzzywuzzy?](https://github.com/seatgeek/fuzzywuzzy) Its a string matching python package that allows us to easily calculate the difference between several strings via [Levenshtein Distance.](https://en.wikipedia.org/wiki/Levenshtein_distance)

Firstly we'll need to install the python package called fuzzywuzzy with:

~~~
pip install fuzzywuzzy
~~~

---

As a side-note, anytime you install python packages you will need to restart the python ikernel to use them within a Jupyter Notebook <strong> (click Kernel at the top, then click Restart & Clear Output). </strong>

---

One of the most useful functions from the [fuzzywuzzy package](https://github.com/seatgeek/fuzzywuzzy) is the .dedupe() function, so let's see it in action!


In [161]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

In [21]:
print(process.dedupe(keyword_list_example, threshold=70))

dict_keys(['seo copywriting', 'digital marketing services', 'social media marketing'])


What's important to notice here is that the longer phrases have been chosen as the de-duplicated keywords. Whilst this approach will provide us with a list of duplicates it does come at the cost of loosing what keywords were lost whilst doing the de-duplication.

To get the original of keywords, we can take the de-duplicated dict_keys and simply transform it into a list.

In [25]:
de_duplicated_keywords = list(process.dedupe(keyword_list_example, threshold=70))
print(de_duplicated_keywords, type(de_duplicated_keywords))

['seo copywriting', 'digital marketing services', 'social media marketing'] <class 'list'>


------------------------------------------------------

Also its important to notice that there is a threshold argument which we can pass into process.dedupe:

~~~
process.dedupe(keyword_list, threshold=70)
~~~

This number can range from 0 - 100, the intuition behind it is that as the threshold increases we are only de-duplicating keywords that are more closely related to each other. Therefore if we set a low threshold, we will get more de-duplication, however it might come at the cost of quality de-duplication. 

There is a balance to using the threshold parameter and I encourage you to try different different threshold values. So let's do just that! :) 

We can use the following code to loop over a range of numbers (stepping up with increments of 10):

~~~

for i in range(10, 100, 10):
    print(i)
    # This will start at the number 10 and will increment in steps of 10 up to 90

~~~

In [27]:
threhold_dictionary = {}

In [39]:
for i in range(10, 100, 10):
    # Increasing the number of steps from 10 --> 90 by 10 at a time:
    print(f"This is the current threhold value: {i}")
    # Performing the de-duplication
    de_duplicated_items = process.dedupe(keyword_list_example, threshold=i)
    # Storing the de-duplicated items
    threhold_dictionary[i] = list(de_duplicated_items)

This is the current threhold value: 10
This is the current threhold value: 20
This is the current threhold value: 30
This is the current threhold value: 40
This is the current threhold value: 50
This is the current threhold value: 60
This is the current threhold value: 70
This is the current threhold value: 80
This is the current threhold value: 90


In [76]:
print(threhold_dictionary)

{10: ['digital marketing services'], 20: ['digital marketing services'], 30: ['digital marketing services', 'digital marketing'], 40: ['social media marketing', 'digital marketing services', 'seo copywriting'], 50: ['seo copywriting', 'digital marketing services', 'social media marketing'], 60: ['seo copywriting', 'digital marketing services', 'social media marketing'], 70: ['seo copywriting', 'digital marketing services', 'social media marketing'], 80: ['seo copywriting', 'digital marketing services', 'social media marketing'], 90: ['seo copywriting', 'digital marketing services', 'social media marketing']}


---------------------------------------------------------------

### How To Open An Ahrefs CSV Keyword Report With Pandas

I've prepared a sample report from the ahrefs keyword explorer for the term "digital marketing". We'll be using this to show you a few ways to doing keyword data analysis + grouping within pandas. Firstly we can load the CSV file with the following syntax:

~~~

df = pd.read_csv('csv_file_path.csv')

~~~

<strong> Important Points: </strong>

- All of the keyword reports from Ahrefs are tab seperated, this will need to be specified when we read the CSV file.
- Depending upon how you download keyword.csv's from Ahrefs will require a different type of encoding. Again this can be specified inside of the pd.read_csv() function.


Examples:

~~~

df = pd.read_csv('data/digital-marketing-keyword-ideas.csv', encoding='UTF-16', delimiter='\t')
df = pd.read_csv('data/digital-marketing-keyword-ideas.csv', encoding='UTF-8', delimiter='\t')

~~~

In [31]:
df = pd.read_csv('data/digital-marketing-keyword-ideas.csv', encoding='UTF-16', delimiter='\t')

~~~

.info() allows us to inspect and see how many np.nan values are inside of the dataframe

~~~

In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
#                 1000 non-null int64
Keyword           1000 non-null object
Country           1000 non-null object
Difficulty        599 non-null float64
Volume            658 non-null float64
CPC               462 non-null float64
Clicks            188 non-null float64
CPS               188 non-null float64
Return Rate       188 non-null float64
Parent Keyword    602 non-null object
Last Update       597 non-null object
SERP Features     562 non-null object
dtypes: float64(6), int64(1), object(5)
memory usage: 93.9+ KB


### Column Selection

Dataframes are great and all of the common operations that you were previously implementing in gsheets can be completely autoamted within Pandas. Let's see how to select single and multiple columns:

~~~

df['single_column'] --> This will return a pd.series object which is essentially a single column.
df[['column_one', 'column_two']] --> This will return a dataframe object, similar to the original df.

~~~


In [80]:
df['Keyword']

0                                coast
1                              hubspot
2                    digital marketing
3                              digital
4                      content meaning
                    ...               
603            results through digital
604    star digital marketing services
610         how to market b2b services
620           phone marketing services
641        campaign marketing services
Name: Keyword, Length: 602, dtype: object

In [82]:
df[['CPC', 'Clicks']]

Unnamed: 0,CPC,Clicks
0,2.5,64715.0
1,5.0,59708.0
2,7.0,11033.0
3,2.5,5912.0
4,17.0,622.0
...,...,...
603,,19.0
604,,
610,,
620,,


### How To Index Specific Columns And Rows

There are two ways that you can index columns either with .loc or with .iloc:

~~~

.loc[] refers to the column and index names
.iloc refers to the index position of the column and index

~~~

---

Remember when it comes to indexing your dataframe the order is <strong> first ROWS, then COLUMNS </strong>


In [86]:
df.loc[0:5, ['Keyowrd', 'Country']].reindex()

Unnamed: 0,Keyowrd,Country
0,,gb
1,,gb
2,,gb
3,,gb
4,,gb
5,,gb


In [90]:
df.iloc[0:5,0:2]

Unnamed: 0,Keyword,Country
0,coast,gb
1,hubspot,gb
2,digital marketing,gb
3,digital,gb
4,content meaning,gb


------------------------------------------------------------

### Sorting Dataframes By Column Values

Let's rank the keywords by the organic monthly search volume in descending order, also notice that I've used inplace=True, this means that the pandas dataframe is permanently sorted by the search volume:

~~~

df.sort_values(by='column_name', ascending=Boolean, inplace=Boolean)
df.head(integer)

- The df.head() command allows us to easily to show the top N results.

~~~

In [27]:
df.sort_values(by='Volume', ascending=False, inplace=True)
df.head(6)

Unnamed: 0,#,Keyword,Country,Difficulty,Volume,CPC,Clicks,CPS,Return Rate,Parent Keyword,Last Update,SERP Features
0,1,coast,gb,42.0,70000.0,2.5,64715.0,0.93,1.67,coast,2020-05-09 23:55:39,Sitelinks
1,2,hubspot,gb,67.0,63000.0,5.0,59708.0,0.95,2.34,hubspot,2020-05-09 00:57:01,"Adwords top, Sitelinks, Top stories, Thumbnail..."
2,3,digital marketing,gb,74.0,19000.0,7.0,11033.0,0.57,1.41,digital marketing,2020-05-09 07:11:09,"Adwords top, Sitelinks, People also ask, Top s..."
3,4,digital,gb,89.0,16000.0,2.5,5912.0,0.38,1.19,digital,2020-05-09 02:26:10,"Sitelinks, People also ask, Top stories, Thumb..."
4,5,content meaning,gb,45.0,4400.0,17.0,622.0,0.14,1.18,content,2020-05-10 06:32:24,"Knowledge card, People also ask, Top stories, ..."
5,6,digital media,gb,24.0,3600.0,3.0,1671.0,0.47,1.24,digital media,2020-05-10 14:14:41,"Featured snippet, Thumbnails, People also ask,..."


Now let's sort the dataframe based upon CPC:

In [30]:
df.sort_values(by='CPC', ascending=False).head(6)

Unnamed: 0,#,Keyword,Country,Difficulty,Volume,CPC,Clicks,CPS,Return Rate,Parent Keyword,Last Update,SERP Features
525,526,internet marketing companies,gb,64.0,40.0,60.0,,,,digital marketing agency,2020-04-29 00:18:03,Featured snippet
442,443,mobile marketing companies,gb,9.0,50.0,40.0,,,,mobile marketing agency,2020-05-04 12:55:50,Featured snippet
164,165,digital market,gb,83.0,150.0,30.0,78.0,0.58,1.32,digital marketplace,2020-04-24 15:59:15,"Sitelinks, People also ask, Top stories, Thumb..."
162,163,search marketing agency,gb,51.0,150.0,25.0,,,,search marketing agency,2020-04-24 18:47:07,"Adwords top, Sitelinks, Adwords bottom"
56,57,seo services london,gb,44.0,350.0,25.0,320.0,0.96,2.74,seo services london,2020-05-01 03:22:55,"Adwords bottom, Sitelinks"
256,257,top digital marketing companies,gb,39.0,90.0,25.0,,,,digital marketing company,2020-04-16 17:50:50,"Sitelinks, People also ask"


Okay that's great, but as Ahrefs provides a Parent Keyword column, let's firstly remove any keywords that don't have a value for this column:

~~~

.dropna(subset='column_name') This command allows us to drop np.NaN (not a number) values from the dataframe.

~~~

Also let's remove the # column as it is unnecessary:

In [45]:
df.dropna(subset=['Parent Keyword'], inplace=True)
df.drop(columns=['#'], inplace=True)
df.head(3)

Unnamed: 0,Keyword,Country,Difficulty,Volume,CPC,Clicks,CPS,Return Rate,Parent Keyword,Last Update,SERP Features
0,coast,gb,42.0,70000.0,2.5,64715.0,0.93,1.67,coast,2020-05-09 23:55:39,Sitelinks
1,hubspot,gb,67.0,63000.0,5.0,59708.0,0.95,2.34,hubspot,2020-05-09 00:57:01,"Adwords top, Sitelinks, Top stories, Thumbnail..."
2,digital marketing,gb,74.0,19000.0,7.0,11033.0,0.57,1.41,digital marketing,2020-05-09 07:11:09,"Adwords top, Sitelinks, People also ask, Top s..."
3,digital,gb,89.0,16000.0,2.5,5912.0,0.38,1.19,digital,2020-05-09 02:26:10,"Sitelinks, People also ask, Top stories, Thumb..."
4,content meaning,gb,45.0,4400.0,17.0,622.0,0.14,1.18,content,2020-05-10 06:32:24,"Knowledge card, People also ask, Top stories, ..."
5,digital media,gb,24.0,3600.0,3.0,1671.0,0.47,1.24,digital media,2020-05-10 14:14:41,"Featured snippet, Thumbnails, People also ask,..."


### Utilising GroupBy Objects:

We can use a Pandas function called .groupby() which will allow us to group keywords based upon their Parent Keyword:
    
~~~

df.groupby('column_name')

~~~

We will also save this groupby object to a variable so that we can reference it directly.

In [48]:
grouped_parent_keywords = df.groupby('Parent Keyword')

In [53]:
grouped_parent_keywords.count()

Unnamed: 0_level_0,Keyword,Country,Difficulty,Volume,CPC,Clicks,CPS,Return Rate,Last Update,SERP Features
Parent Keyword,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
5 ws,2,2,2,2,0,1,1,1,2,2
a,3,3,3,3,0,0,0,0,3,3
absolute digital media,1,1,1,1,1,0,0,0,1,1
absolute pr,1,1,1,1,1,0,0,0,1,1
according,2,2,2,2,1,2,2,2,2,2
...,...,...,...,...,...,...,...,...,...,...
why work in digital marketing,1,1,1,1,1,1,1,1,1,1
work digital,1,1,1,1,1,0,0,0,1,1
world of digital,1,1,1,1,1,0,0,0,1,1
بازاریابی,1,1,1,1,0,0,0,0,1,1


In [60]:
grouped_parent_keywords.mean()

Unnamed: 0_level_0,Difficulty,Volume,CPC,Clicks,CPS,Return Rate
Parent Keyword,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
5 ws,24.000000,95.0,,64.0,0.54,1.00
a,41.333333,50.0,,,,
absolute digital media,55.000000,80.0,9.0,,,
absolute pr,0.000000,30.0,0.0,,,
according,7.000000,125.0,0.0,18.0,0.15,1.13
...,...,...,...,...,...,...
why work in digital marketing,0.000000,100.0,0.0,157.0,1.50,1.67
work digital,3.000000,60.0,0.0,,,
world of digital,33.000000,200.0,0.0,,,
بازاریابی,8.000000,60.0,,,,


--------------------------------------------------------------------------------

Notice that as it is grouping a series of keywords by their Parent Keyword, we need to use aggregation to summarise the grouped metrics. Common functions include the following:

~~~

.mean()
.count()
.sum()
.median()

~~~

However for our analysis we'll want to use a custom .agg() function so that we can apply different summarisation techniques to unique columns:

In [66]:
grouped_parent_keywords = grouped_parent_keywords.agg({'Keyword': 'count',
                             'Difficulty': 'mean', 
                             'Volume': 'sum', 
                             'CPC': 'mean',
                             'CPS': 'mean', 
                             'Return Rate': 'mean'}, inplace=True)

## We Do Section:

Now we can use the similar filtering techniques that we learned earlier. Code along and filter the groupedby dataframe by: 

- Difficulty
- Volume

In [71]:
grouped_parent_keywords.sort_values(by='Volume', ascending=False, inplace=True)
grouped_parent_keywords.sort_values(by='Difficulty', ascending=False)

Unnamed: 0_level_0,Keyword,Difficulty,Volume,CPC,CPS,Return Rate
Parent Keyword,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
whois,3,94.666667,1300.0,0.475,0.666667,1.156667
is it down,1,92.000000,90.0,0.000,,
the digital picture,1,86.000000,40.0,0.000,,
what,2,84.500000,170.0,,,
the drum,1,84.000000,90.0,0.000,,
...,...,...,...,...,...,...
digital sales,1,0.000000,60.0,0.000,0.370000,1.120000
tps,1,,10.0,,,
b2b marketing strategy,1,,10.0,,,
star digital,1,,10.0,,,


----------------------------------------------------------------------------------------------------

### Removing exact duplicates

Let's now return to our original dataframe and practice some other useful methods. Firstly we can attempt to drop any duplicates within our keyword column:

In [77]:
df.drop_duplicates(subset=['Keyword'], inplace=True)

However as there are no duplicates inside of the keywords column, let's get a de-duplicated list of parent keywords by using the following command:

~~~

.drop_duplicates(subset=['column_name'])

~~~

Then we will select the Parent Keyword column, and convert it into a python list with:

~~~

.tolist()

~~~


In [108]:
parent_keywords_df = df.drop_duplicates(subset=['Parent Keyword'])
parent_keywords_list = parent_keywords_df['Parent Keyword'].to_list()[0:10]
print(f"This is the first 10 items of a de-duplicated python list from the parent column: {parent_keywords_list}")

This is the first 10 items of a de-duplicated python list from the parent column: ['coast', 'hubspot', 'digital marketing', 'digital', 'content', 'digital media', 'digital marketing agency', 'digital uk', 'and digital', 'online marketing']


### DataFrame Filtering (subsetting)

The easiest type of filtering is to use the = operator. 

For example, we might want to find keywords that are equal to hubspot within our dataframe:

In [226]:
single_keyword_mask = df['Keyword'] == 'hubspot'
single_keyword_df = df[single_keyword_mask]
single_keyword_df

Unnamed: 0,Keyword,Country,Difficulty,Volume,CPC,Clicks,CPS,Return Rate,Parent Keyword,Last Update,SERP Features,Stemmed Keyword
1,hubspot,gb,67.0,63000.0,5.0,59708.0,0.95,2.34,hubspot,2020-05-09 00:57:01,"Adwords top, Sitelinks, Top stories, Thumbnail...",hubspot


We could also write the same filter like so:
    
~~~

single_keyword_df = df[df['Keyword'] == 'hubspot']

~~~

----------------------------------------

Additionally might want to filter our dataframe to only find keywords that have over a specific search volume greater than 50, so let's do just that:

~~~

df['column_name'] > 50

~~~


In [115]:
mask = df['Volume'] > 50
filtered_dataframe = df[mask]

Another short handed way to accomplish the same operation would be:

~~~

filtered_dataframe  = df[df['Volume'] > 50]

~~~


----------------------------------------------------------------------

We can also filter the dataframe by several columns by chaining the boolean dataframe subsets together:

In [119]:
mask_two = (df['Volume'] > 50) & (df['CPC'] > 2.0)
two_column_filtered_dataframe = df[mask_two]

This could also be written like this:

~~~

two_column_filtered_dataframe = df[(df['Volume'] > 50) & (df['CPC'] > 2.0)]

~~~

--------------------------------------------------

We can also do OR operations with the pipe operator <strong> | </strong>

In [125]:
mask_three = (df['Volume'] > 50) | (df['CPC'] > 2.0)
or_dataframe_subset= df[mask_three]

Which could also be written like this:
    
~~~

two_column_filtered_dataframe = df[(df['Volume'] > 50) | (df['CPC'] > 2.0)]

~~~

------------------------------------------------------------------------------------------------------------------------

### Grouping de_duplicated keywords with FuzzyWuzzy

As well as de-duplicating lists of keywords, it would also be useful to keep all of the keywords but bucket them into keyword groups based upon how close every keyword was as a duplicate in reference to every keyword. 

Without going into the specifics of how this functions work, you can use it as a way to group keywords based upon their FuzzyWuzzy score:

In [163]:
def word_grouper(df, column_name=None, limit=6, threshold=85):
    # Create a near_match_duplicated_list 
    test = df.drop_duplicates(subset=column_name)[column_name].tolist()
    master_dict = {}
    processed_words = []
    no_matches = []

    for index, item in enumerate(df[column_name]):
        # Let's pop out the first index from the list so we never match against 
        try:
            test.pop(0)
        except IndexError as e:
            print(e)
        
        # Let's only loop over the keywords that aren't already grouped
        if item not in processed_words:
            # Creating the top N matches
            try:
                matches = process.extract(item, test, limit=limit)
                """Extracting out the matched words - A threshold for this can be changed so that 
                    we never cluster words together with a low match score"""
                matches = [item for item in matches if item[1] > threshold]
                matched_words = [item[0] for item in matches if item[1] > threshold]
                # Saving the matches to a dictionary
                master_dict[item] = matches
                # Saving the matches and  to a list of processed words
                processed_words.extend(matched_words)
            except Exception as e:
                no_matches.append(item)
        else:
            pass
    return master_dict

In [188]:
grouped_words = word_grouper(df, column_name='Keyword', limit=6, threshold=70)

In [189]:
test = pd.DataFrame()
test = test.from_dict(grouped_words, orient='index')
test.iloc[0:20,:]

--------------------------------------------------------------------------------------------------------------------

### Additional Ways That We Can Improve Our De-Duplication Efforts

#### 1. Using .lower()

.lower() on a list of strings ensures that any strings which are duplicates such as "digital marketing services" vs "Digital Marketing Services" can be normalised with this in-built function.

In [192]:
example = ['DIGITAL MARKETING SERVICES', 'digital marketing services', 'Digital Marketing Services']

In [195]:
example = [word.lower() for word in example]
print(example)

['digital marketing services', 'digital marketing services', 'digital marketing services']


In [200]:
de_duplicated_example = list(set(example))
print(f"This is the de_duplicated_example {de_duplicated_example}")

This is the de_duplicated_example ['digital marketing services']


------------------------------------------------------------------------

#### 2. Stemming + Lemmatisation

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

Luckily I've already got some stemming + lemmatized functions that we can utilise on our previous dataframe.

Firstly we'll need to install the python package called NLTK with:

~~~
pip install nltk
~~~

In [203]:
from nltk.util import ngrams
from nltk.stem import PorterStemmer, LancasterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

In [205]:
porter=PorterStemmer()
lancaster=LancasterStemmer()

# NLP Processing
def stem_sentence(sentence):
    token_words=word_tokenize(sentence)
    # token_words
    stem_sentence=[]
    for word in token_words:
        stem_sentence.append(porter.stem(word))
        stem_sentence.append(" ")
    return "".join(stem_sentence)

#rejoin words after stemming
def rejoin_words(row):
    joined_words = ( " ".join(row))
    return joined_words

#prepare list of all words in criteria report
def prepare_list(df, column):
    results = []
    for t in df[column]:
        x=t.split()
        for i in x:
            results.append(i)
    # Remove Duplicates:           
    return list(set(results))

In [227]:
def test(x):
    return x

Now we will use an .apply() method on the dataframe, basically how this method works is that it will perform an operation on either every row, or every column. One at a time! 

For example, let's do a very simple method using the above function (test). Ths will simply return every row:

~~~

df['Keyword'].apply(test)

~~~

In [228]:
df['Keyword'].apply(test)

0                                coast
1                              hubspot
2                    digital marketing
3                              digital
4                      content meaning
                    ...               
603            results through digital
604    star digital marketing services
610         how to market b2b services
620           phone marketing services
641        campaign marketing services
Name: Keyword, Length: 602, dtype: object

Now let's apply some stemming to our keyword column with an .apply() method: 

- Then we will save it as a new column by re-assigning it to the original dataframe with a new name.

In [229]:
df['Stemmed Keyword'] = df['Keyword'].apply(stem_sentence)

In [230]:
df.head(6)

Unnamed: 0,Keyword,Country,Difficulty,Volume,CPC,Clicks,CPS,Return Rate,Parent Keyword,Last Update,SERP Features,Stemmed Keyword
0,coast,gb,42.0,70000.0,2.5,64715.0,0.93,1.67,coast,2020-05-09 23:55:39,Sitelinks,coast
1,hubspot,gb,67.0,63000.0,5.0,59708.0,0.95,2.34,hubspot,2020-05-09 00:57:01,"Adwords top, Sitelinks, Top stories, Thumbnail...",hubspot
2,digital marketing,gb,74.0,19000.0,7.0,11033.0,0.57,1.41,digital marketing,2020-05-09 07:11:09,"Adwords top, Sitelinks, People also ask, Top s...",digit market
3,digital,gb,89.0,16000.0,2.5,5912.0,0.38,1.19,digital,2020-05-09 02:26:10,"Sitelinks, People also ask, Top stories, Thumb...",digit
4,content meaning,gb,45.0,4400.0,17.0,622.0,0.14,1.18,content,2020-05-10 06:32:24,"Knowledge card, People also ask, Top stories, ...",content mean
5,digital media,gb,24.0,3600.0,3.0,1671.0,0.47,1.24,digital media,2020-05-10 14:14:41,"Featured snippet, Thumbnails, People also ask,...",digit media


We could then use these stemmed/lemmatized keywords instead of the original keywords whilst performing de-duplication with FuzzyWuzzy!

In [233]:
stemmed_keywords = df['Stemmed Keyword'].tolist()
de_duplicated_stemmed_keywords = list(set(stemmed_keywords))

In [238]:
stemmed_unique_keywords = list(process.dedupe(de_duplicated_stemmed_keywords, threshold=80))

--------------------------------------------------------------------------------

The below code is how you could filter the dataframe to return uniques with the .isin() function:

~~~

df[df['Stemmed Keyword'].isin(python_list)]

~~~

In [242]:
unique_stemmed_df = df[df['Stemmed Keyword'].isin(stemmed_unique_keywords)]

Also you will notice that the index on the rows is not completely reset:

We can reset the index with:

~~~

unique_stemmed_df.reset_index()

~~~

In [244]:
unique_stemmed_df.head(6)

Unnamed: 0,Keyword,Country,Difficulty,Volume,CPC,Clicks,CPS,Return Rate,Parent Keyword,Last Update,SERP Features,Stemmed Keyword
1,hubspot,gb,67.0,63000.0,5.0,59708.0,0.95,2.34,hubspot,2020-05-09 00:57:01,"Adwords top, Sitelinks, Top stories, Thumbnail...",hubspot
4,content meaning,gb,45.0,4400.0,17.0,622.0,0.14,1.18,content,2020-05-10 06:32:24,"Knowledge card, People also ask, Top stories, ...",content mean
10,hub spot,gb,53.0,2600.0,7.0,1820.0,0.71,1.37,hubspot,2020-05-05 05:59:21,"Adwords top, Sitelinks, Top stories, Thumbnail...",hub spot
15,climb online,gb,3.0,1300.0,0.0,1228.0,0.95,1.2,climb online,2020-05-04 21:52:16,"Sitelinks, Top stories, Thumbnails",climb onlin
18,company meaning,gb,55.0,1200.0,0.0,158.0,0.13,1.11,مؤسسة,2020-05-08 16:52:22,"Knowledge card, People also ask, Sitelinks, To...",compani mean
20,digital marketing strategy,gb,65.0,1200.0,9.0,1112.0,0.96,1.36,digital marketing strategy,2020-05-08 20:21:50,"Featured snippet, Thumbnails, People also ask,...",digit market strategi


In [248]:
unique_stemmed_df.reset_index(drop=True, inplace=True)
unique_stemmed_df.head(6)

Unnamed: 0,Keyword,Country,Difficulty,Volume,CPC,Clicks,CPS,Return Rate,Parent Keyword,Last Update,SERP Features,Stemmed Keyword
0,hubspot,gb,67.0,63000.0,5.0,59708.0,0.95,2.34,hubspot,2020-05-09 00:57:01,"Adwords top, Sitelinks, Top stories, Thumbnail...",hubspot
1,content meaning,gb,45.0,4400.0,17.0,622.0,0.14,1.18,content,2020-05-10 06:32:24,"Knowledge card, People also ask, Top stories, ...",content mean
2,hub spot,gb,53.0,2600.0,7.0,1820.0,0.71,1.37,hubspot,2020-05-05 05:59:21,"Adwords top, Sitelinks, Top stories, Thumbnail...",hub spot
3,climb online,gb,3.0,1300.0,0.0,1228.0,0.95,1.2,climb online,2020-05-04 21:52:16,"Sitelinks, Top stories, Thumbnails",climb onlin
4,company meaning,gb,55.0,1200.0,0.0,158.0,0.13,1.11,مؤسسة,2020-05-08 16:52:22,"Knowledge card, People also ask, Sitelinks, To...",compani mean
5,digital marketing strategy,gb,65.0,1200.0,9.0,1112.0,0.96,1.36,digital marketing strategy,2020-05-08 20:21:50,"Featured snippet, Thumbnails, People also ask,...",digit market strategi


### Saving Our Final Dataframe To CSV

After you've performed any data analysis or script, you can easily save the pandas series object or pandas dataframe to CSV with:

~~~

df.to_csv('name_of_file.csv', index=True)

~~~

In [250]:
unique_stemmed_df.to_csv('stemmed_dataframe.csv', index=True)

---------------------------------------------------------

Congratulations, you've made it to the end of the first tutorial! 

We've introduced you to the following frameworks:

- Python Sets
- FuzzyWuzzy
- Pandas
- NLP Stemming & Lemmatization

### Its time to continue on our epic journey and to learn more scripts which will help to automate your SEO life! 