### Effective Name Matching in Python(Data Extraction and cleaning)

### Introduction

When using data, most people agree that your insights and analysis are only as good as the data you are using. Essentially, garbage data in is garbage analysis out. Data cleaning, also referred to as data cleansing and data scrubbing, is one of the most important steps for your organization if you want to create a culture around quality data decision-making.

We do face lots of cases where we have to match a word with a lot of variations. This can be because of typos, pronunciation errors, nicknames, short forms, etc. This can be experienced in the case of matching names in the database with query names. When you need matching of input text to the database we can not expect an exact match always. NLP developers might have gone through scenarios to extract names as named entities and to match it for misspellings and mistranslations.
![1_B7SlxsGryquq8tmVcFJWxA.png](attachment:1_B7SlxsGryquq8tmVcFJWxA.png)

This project uses a suprisingly effective way of matching scraped hotels names from different websites.

#### Problem Statement 
I most recently faced a challenge Where i had to extract hotel names from data that was scraped from articles of different hotel agencies websites .The problem was that each of these websites used different names to refer to the same hotel so I later had tomatch these extracted hotel namesto the standard names given in another csv file.

So I divided the task to two parts:

    i. Extracting the names
    ii . Matching the hotel names
    
but before this I had to first understand the data that I had and to look for patterns if there were any
 

#### Import the necessary modules

In [9]:
import pandas as pd
import re
import string
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from scipy.sparse import csr_matrix
import sparse_dot_topn.sparse_dot_topn as ct  #Cosine Similarity
import time
#pd.set_option('display.max_colwidth', -1)
import warnings
warnings.filterwarnings('ignore')

In [10]:
#Loading the data sets
scraped= pd.read_csv('Hotel Location Table Mapping.csv',encoding='latin1')
hotels=pd.read_csv('Hotel Location.csv',encoding='latin1')
scraped.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8381 entries, 0 to 8380
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   scraped_url   8379 non-null   object 
 1   agency_slug   8379 non-null   object 
 2   country       7992 non-null   object 
 3   title         8379 non-null   object 
 4   description   5270 non-null   object 
 5   inclusions    6884 non-null   object 
 6   hotel_name    4524 non-null   object 
 7   hotel_mapped  0 non-null      float64
dtypes: float64(1), object(7)
memory usage: 523.9+ KB


The scraped data frame contains the scraped data from different agencies' websites.it contains seven columns and  8381 entries.The goalis to fill the hotel_mapped field which only contains null values,  with the a hotel name mapped from the hotels dataframe.The name should first be extracted from either the title,description or the inclusions column and then mapped to the right hotel name in the hotels table.

We have also been provided with a *hotel_name* column which also contains extracted hotel names but has only 4524 non null values,leaving 3857 null values in the hotel_name column.

In [8]:
scraped.sample(2)


Unnamed: 0,scraped_url,agency_slug,country,title,description,inclusions,hotel_name,hotel_mapped
8333,https://packages.travelstart.co.za/Westin-Turt...,travelstart,mauritius,Westin Turtle Bay Resort,"<p style=""font-size: 16px;"">Revel in the extra...","<p style=""font-size: 16px;"">Return flights fro...",the westin turtle bay resort & spa mauritius,
4350,https://computravel.co.za/packages/11084/mauri...,computravel,mauritius,Mauritius - 5* Maritim Resort - All Inclusive ...,"<div class=""item-description-text"">\n\n<h6>Qui...",,,


The hotels table contains the correct names and locations of the hotels,which sould later be matched with  the extracted hotels names

In [4]:
hotels.sample(2)

Unnamed: 0,Name,Address1,Address2,City,StateProvince,PostalCode,Country,Latitude,Longitude,AirportCode,PropertyCurrency,StarRating,Location
25384,Clay Corner Inn,401 Clay Street SW,,Blacksburg,VA,24060,US,37.22504,-80.41498,ROA,USD,3.0,Near Cassell Coliseum
118298,Santosa City Hotel,Jl. Patih Jelantik No.8,,Denpasar,,80361,ID,-8.66846,115.21694,DPS,IDR,3.0,Near Bali Museum


The table also contains the country code and  addresses which we could further use for precise matching.

### Name Extraction 

The first step was to select only the rows that contained null values in the hotel_name column so that I could deal with lesser rows.

In [11]:
nonselected_rows = scraped[scraped['hotel_name'].notnull()]
selected_rows = scraped[scraped['hotel_name'].isnull()]

In [7]:
display(selected_rows.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3857 entries, 0 to 8380
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   scraped_url   3855 non-null   object 
 1   agency_slug   3855 non-null   object 
 2   country       3557 non-null   object 
 3   title         3855 non-null   object 
 4   description   2193 non-null   object 
 5   inclusions    2486 non-null   object 
 6   hotel_name    0 non-null      object 
 7   hotel_mapped  0 non-null      float64
dtypes: float64(1), object(7)
memory usage: 271.2+ KB


None

We should then create a new column ,hotels_new to fill the new extracted names

In [12]:
nonselected_rows['hotels_new'] = nonselected_rows['hotel_name']

**Data cleaning**

In [14]:
def clean_text(text):
    """
    This function uses regular expressions to 
        - remove links characters
        - remove html characters,
        - remove escape sequences,
        - remove capitalization,
        - any extra white space from each text and then converts them to lowercase.

    Input:
    text: original text  
          datatype: string

    Output:
    texts: modified text
           datatype: string
    """
    # replace links with url-web
    pattern_url = 'http[s]?://(?:[A-Za-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9A-Fa-f][0-9A-Fa-f]))+'
    subs_url = 'url-web'
    text = re.sub(pattern_url, subs_url, text)
    # replace the html characters with " "
    text = re.sub('<.*?>#()&', ' ', text)
    text = re.sub('&', ' ', text)
#     text = re.sub('(', ' ', text)
#     text = re.sub(')', ' ', text)
    
    text=re.sub('Nan', ' ', text)
    # replace escape sequence with space
    escape_seq = ["\r", "\n",'\a', '\b', '\f', "\`", '\”', '\t']
    for i in escape_seq:
        text = re.sub(i, " ", text)
    # Remove Capitalization
    text = text.lower()
    # plit and join the words
    text=' '.join(text.split())
    
    return text

In [15]:
def clean(text):
    
    # removing paragraph numbers
    text = re.sub('[0-9]+.\t','',str(text))
    # removing new line characters
    text = re.sub('\n ','',str(text))
    text = re.sub('\n',' ',str(text))
    # removing apostrophes
    text = re.sub("'s",'',str(text))
    # removing hyphens
    text = re.sub("#",' ',str(text))
    text = re.sub("@",'',str(text))
    
    # removing quotation marks
    text = re.sub('\"','',str(text))
    # removing salutations
    text = re.sub("Mr\.",'Mr',str(text))
    text = re.sub("Mrs\.",'Mrs',str(text))
    # removing any reference to outside text
    text = re.sub("[\(\[].*?[\)\]]#", "", str(text))
    
    return text

In [16]:
def remove_punctuation(post):
    return ''.join([l for l in post if l not in string.punctuation])

In [17]:
#Applying the cleaning functions to the description column
scraped['description'] = scraped['description'].astype(str).apply(clean_text)
scraped['description'] = scraped['description'].astype(str).apply(remove_punctuation)

**What I did now was to group by the df by domain name and then observe the patterns in  each website.This would then easen the names extraction task .**

In [18]:
#function to extract the domain name from the URL
def domain_name(url):
    return url.split("www.")[-1].split("//")[-1].split(".")[0]

In [19]:
selected_rows['domain'] =selected_rows['scraped_url'].astype(str).apply(domain_name)

In [20]:
#Create a list of unique domain names in the domain name column
type_labels = list(selected_rows.domain.unique())

We can see that we have 72 unique domain names,this will make it tough to observe html patterns in each website,but it is still the easiest and most computationaly effective way.I had earlier on tried the  **substring** method and the **fuzzywuzzy module** both of which gave inaccurrate results and they also took too much time running.

In [21]:
#igotravelagency
neew = selected_rows.loc[selected_rows['domain']=='igotravel']

In [22]:
neew.sample(2)

Unnamed: 0,scraped_url,agency_slug,country,title,description,inclusions,hotel_name,hotel_mapped,domain
226,https://www.igotravel.co.za/holidays/sea-cliff...,igotravel,tanzania,Sea Cliff Resort Spa Zanzibar,"<div class=""woocommerce-product-details__short...","<div class=""blck col-sm-6"">\n<h3 class=""blck-t...",,,igotravel
204,https://www.igotravel.co.za/holidays/cinnamon-...,igotravel,maldives,Cinnamon Hakuraa Huraa,"<div class=""woocommerce-product-details__short...","<div class=""blck col-sm-6"">\n<h3 class=""blck-t...",,,igotravel


The igotravel,packages agencies has the hotel name as the title,

In [24]:
neew['hotels_new'] = neew['title']

In [25]:
neew1 = selected_rows.loc[selected_rows['domain']=='packages']

In [26]:
neew1['hotels_new'] = neew1['title']

In [27]:
neew2 = selected_rows.loc[selected_rows['domain']=='computravel']

In [33]:
neew2.sample(2)

Unnamed: 0,scraped_url,agency_slug,country,title,description,inclusions,hotel_name,hotel_mapped,domain
5044,https://computravel.co.za/packages/11102/mauri...,computravel,mauritius,Mauritius - 4* plus Shandrani Beachcomber 25% ...,"<div class=""item-description-text"">\n\n<h6>Qui...",,,,computravel
6400,https://computravel.co.za/packages/11131/mauri...,computravel,mauritius,Mauritius - 5* Long Beach Resort - Pay 5 Stay ...,"<div class=""item-description-text"">\n\n<h6>Qui...",,,,computravel


The domain name **computravel** has a pattern in the title.

In [28]:

def findAfter (txt):
#len(txt.split(''))
    myList = txt.split(" ")
    for x in myList:
        if x=='4*':
            start = txt.find (x) #Sets start to the start of keyword
            end = start + len (x) #Sets end to the end of the keyword
            return txt [end:end+30] 
        elif x=='3*':
            start = txt.find (x) #Sets start to the start of keyword
            end = start + len (x) #Sets end to the end of the keyword
            return txt [end:end+30] 
        elif x=='2*':
            start = txt.find (x) #Sets start to the start of keyword
            end = start + len (x) #Sets end to the end of the keyword
            return txt [end:end+30] 
        elif x=='5*':
            start = txt.find (x) #Sets start to the start of keyword
            end = start + len (x) #Sets end to the end of the keyword
            return txt [end:end+30] 
        elif x=='5star':
            start = txt.find (x) #Sets start to the start of keyword
            end = start + len (x) #Sets end to the end of the keyword
            return txt [end:end+30]
    

In [29]:
neew2['value'] = neew2['title'] .astype(str).apply(findAfter)

In [30]:
 neew2['hotels_new'] = neew2['value'].str.extract('(.+?)-')

**Pentravel** has the title as the hotel name

In [31]:
neew3 = selected_rows.loc[selected_rows['domain']=='pentravel']

In [32]:
neew3['hotels_new'] = neew3['title']

**Flightcentre** agency has the na,es of the hotelsin the description column

In [33]:
neew4 = selected_rows.loc[selected_rows['domain']=='flightcentre']

In [34]:
def findAfter2 (txt):
#len(txt.split(''))
    myList = txt.split(" ")
    for x in myList:
        if x=='4-star':
            start = txt.find (x) #Sets start to the start of keyword
            end = start + len (x) #Sets end to the end of the keyword
            return txt [end:end+30] 
        elif x=='3star':
            start = txt.find (x) #Sets start to the start of keyword
            end = start + len (x) #Sets end to the end of the keyword
            return txt [end:end+30] 
        elif x=='4star':
            start = txt.find (x) #Sets start to the start of keyword
            end = start + len (x) #Sets end to the end of the keyword
            return txt [end:end+30] 
        elif x=='3-star':
            start = txt.find (x) #Sets start to the start of keyword
            end = start + len (x) #Sets end to the end of the keyword
            return txt [end:end+30] 
        elif x=='45star':
            start = txt.find (x) #Sets start to the start of keyword
            end = start + len (x) #Sets end to the end of the keyword
            return txt [end:end+30] 
        elif x=='5star':
            start = txt.find (x) #Sets start to the start of keyword
            end = start + len (x) #Sets end to the end of the keyword
            return txt [end:end+30]
        
        elif x=='4,5-star':
            start = txt.find (x) #Sets start to the start of keyword
            end = start + len (x) #Sets end to the end of the keyword
            return txt [end:end+30]
        elif x=='5-star':
            start = txt.find (x) #Sets start to the start of keyword
            end = start + len (x) #Sets end to the end of the keyword
            return txt [end:end+30]

In [35]:
neew4['hotels_new'] = neew4['description'] .astype(str).apply(findAfter2)

In [36]:
## function for selecting the first tree words to be the name 
def first3(txx):
    lis=txx.split()[:3]
    new=' '.join(lis)
    return new

In [37]:
neew4['hotels_new']=neew4['hotels_new'] .astype(str).apply(first3)

We are now going to concatenate all the dfs that have been created

In [38]:
frames = [neew, neew1,neew2,neew4,neew3]
result = pd.concat(frames)

In [39]:
frames3 = [nonselected_rows,result]
result3 = pd.concat(frames3)

For the rest of the agencies the hotes names is also the title


In [40]:
arr = [
 'casterbridge-hollow',
 'elephant-plains-lodge',
 'idube-game-reserve',
 'marataba-trails-lodge',
 'hans-merensky-hotel-and-golf-estate',
 'bayethe-lodge',
 '4-day-fly-in-safari',
 'madikwe-hills-private-game-lodge',
 'phinda-homestead',
 'long-lee-manor',
 'camp-shawu',
 'thula-thula-private-game-reserve',
 'simbavati-river-lodge',
 'inn-on-the-square',
 'madikwe-safari-lodge',
 'country-boutique-hotel',
 'rhino-ridge-safari-lodge',
 'entabeni-lakeside-lodge',
 'kirkmans-kamp',
 'exeter-river-lodge',
 'ekuthuleni-lodge',
 'nedile-lodge',
 'leopard-hills-private-game-reserve',
 'thonga-beach-lodge',
 'nottens-bush-camp',
 'isibindi-zulu-lodge',
 'hillsnek-safari-camp',
 'camp-shonga',
 'leadwood-lodge',
 'nungubane-game-lodge',
 'impodimo-game-lodge',
 'alpine-heath-resort',
 'entabeni-kingfisher-lodge',
 'mandela-rhodes-place',
 'marataba-safari-lodge',
 'greenway-woods-resort-2980',
 'legend-golf-and-safari-resort',
 '2-day-fly-in-safari',
 'umkumbe-safari-lodge',
 'hollow-on-the-square-cape-town-city-hotel',
 'arathusa-safari-lodge',
 'perry-039-s-bridge-hollow-boutique-hotel',
 'sabi-sabi-bush-lodge',
 'hippo-hollow-country-estate',
 'oceana-beach-and-wildlife-resort',
 '4-day-camping-trip-to-kruger-national-park',
 'ecolux-boutique-hotel-and-spa',
 'bush-lodge',
 'londolozi-private-game-reserve',
 'nkomazi-game-reserve',
 '3-day-fly-in-safari',
 'little-bush-camp',
 'rhino-walking-safari',
 'kosi-forest-lodge',
 'white-elephant-safari-lodge',
 'jock-explorer-camp',
 'baradinckals-bush-lodge',
 'thandeka-game-lodge',
 'dulini-lodge',
 'leolapa',
 'mkuze-falls-private-game-reserve',
 'phinda-forest-lodge',
 'entabeni-wildside-safari-lodge',
 'entabeni-ravineside-lodge',
 'hanglip-mountain-lodge',
 'bushbreaks']

In [41]:
#Group by the arrray and replace the hotels names by the title
neew6=selected_rows.loc[selected_rows['domain'].isin(arr)]
neew6['hotels_new'] = neew6['title']

**Final result**

In [42]:
frames4 = [neew6,result3]
result4 = pd.concat(frames4)

In [43]:
display(result4.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8379 entries, 126 to 6818
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   scraped_url   8379 non-null   object 
 1   agency_slug   8379 non-null   object 
 2   country       7992 non-null   object 
 3   title         8379 non-null   object 
 4   description   5270 non-null   object 
 5   inclusions    6884 non-null   object 
 6   hotel_name    4524 non-null   object 
 7   hotel_mapped  0 non-null      float64
 8   domain        3855 non-null   object 
 9   hotels_new    8073 non-null   object 
 10  value         880 non-null    object 
dtypes: float64(1), object(10)
memory usage: 785.5+ KB


None

we can see that we have filled the hotels_new column awhich only has 306 null values compared to the original 3549 null values that we had.

In [44]:
res=result4.sort_index()

In [45]:
res.to_csv('hotels10.csv',index=False)

###  Methodology (Matching the Names)

Now that  I have this hotels10 table which has  the **hotels_new** column which contains  the names that I extracted ,its time to match the names with the names in the hotels table

In [46]:
hotels4 = pd.read_csv('hotels10.csv')

In [22]:
hotels4['hotels_new'].sample(10)

7807                      swartberg country manor
6384                   Heritage Le Telfair Resort
3201                                     sun city
5546                   5 Star Riu Palace Zanzibar
2123                diamonds mapenzi beach resort
5131    3 star Berjaya Beau Vallon Bay Seychelles
4706                       Hamiltons Tented Camp 
5761                          Club Med Seychelles
2208          gold zanzibar beach house &amp; spa
7841                                  the cellars
Name: hotels_new, dtype: object

I used the cosine similarity matrix because:


    It is fast: the main computation is matrix multiplication, and SciPy and NumPy facilitate fast matrix computation The          computation of tokenization and vectorization can be easily parallelized.
            It is accurate: a tokenizer can make the matching order unrelated and fuzzy, which means an accurate method.
I had tried the  popular **fuzzywuzzy** module which  is the algorithm to match a pattern between a string with a sequence of strings in the database and give a matching similarity — in percentage. It explicitly indicates that the output must be the probability (in the range 0 to 1 or the percentage of similarity) instead of an exact number.

There are many ways to perform fuzzy string matching, for instance, Levenshtein distance,but it has a problem with the algorithm performance. The reason is that each record is compared against all the other records in the data. This phenomenon is well-known as quadratic time complexity.

The TF-IDF is implemented using n-grams of groups of letters caused by the possibility of misspelling or typo. For instance, the word independence is chunked into the following form depends on the number of n.
As our goal here is not just to match the strings but also match it in a faster way. Thus the concept of ngram, TF-IDF with cosine similarity, comes into play.

**N-grams**

N-grams are extensively used in text mining and natural language processing, which are a set of co-occurring words within a given sentence or (word file).To find n-gram, we move one word forward( can move any step as per need).

**TF-IDF**

TF-IDF stands for term frequency-inverse document frequency, and the TF-IDF weight is a weight often used in information retrieval and text mining two prime concerns.

    1. Used to evaluate how important a word is to a document in a collection

    2. The importance increases proportionally to the number of times a word appears
    
**Cosine Similarity**

How the two text documents are close to each other in terms of their context(surface closeness) and meaning, i.e., lexical similarity and semantic similarity respectively is called Text Similarity and there is a various method to calculate text similarities such as Cosine similarity, Euclidean distance, Jaccard coefficient, and Dice.

Cosine similarity measures the text-similarity between two documents irrespective of their size. Mathematically, the Cosine similarity metric measures the cosine of the angle between two n-dimensional vectors projected in a multi-dimensional space, and value ranges from 0 to 1,

where,

    1 means more similarity
    0 means less similarity.
    
**Sparse_dot_topn**

The Data Scientists at ING Wholesale Banking Advanced Analytics team found out Cosine Similarity has some disadvantages:

    -The sklearn version does a lot of type checking and error handling since it is a general-purpose function. But in our   case,     it is guaranteed that multiplication will be done on two sparse matrices with proper sizes and formats..
    -The sklearn version calculates and stores all similarities in one go, while we are only interested in the most similar        ones.So, we do not need to calculate a matrix of size M × N but M × n where n is much smaller than N. This reduces the          memory usage significantly.
    -The similarity scores (i.e. the result matrix entries) below a certain threshold can be ignored easily, so that they do   not keep space in the memory and they are not involved in the partial sorting of the candidate list.
    Therefore both results in the use of more memory consumption and time.

To optimize for these disadvantages, they created their library, which stores only the top N highest matches in each row, and only the similarities above the threshold.

    *To optimize for these disadvantages, they created their library(sparse_dot_topn), which stores only the top N highest matches in each row, and only the similarities above the threshold.
    They proudly claim that this approach improves the speed by about 40% and reduces memory consumption.




#### ngrams 


In [47]:
#  ngrams(here we are taking n = 3 thus 3-gram (trigrams ) 
#  used for cleaning and removing some punctuation (dots, comma’s etc) i.e.((,-./)) from a string 
#  and generate and collect all n-grams of the string.  

 
def ngrams(string, n=3):

    string = re.sub(r'[,-./]|\sBD',r'', string)
    ngrams = zip(*[string[i:] for i in range(n)])
    return [''.join(ngram) for ngram in ngrams]





# Testing ngrams work for verification 

print('All 3-grams in "Deluxroom":')
ngrams('Deluxroom')

All 3-grams in "Deluxroom":


['Del', 'elu', 'lux', 'uxr', 'xro', 'roo', 'oom']

#### TF-IDF and Vectorization


In [48]:
hotels5 = hotels4.drop(['agency_slug','country','title','description','inclusions','hotel_name','hotel_mapped','domain','value','scraped_url'], axis=1)
hotels6 = hotels5.dropna()
hotels6['hotels_new'] = hotels6['hotels_new'].str.replace(r"\(.*\)","")


**AWESOME COSSIM TOP**

First, both A and B are converted to CSR(Compressed sparse row) format. If A and B have already been CSR format, there is no overhead in converting. Then, the number of rows of A and number of columns of B are retrieved. Next, memory for matrix C is reserved as shown in line 9 to 15. The maximum space is M*ntop elements. Before calling the C++ code, we also check some boundary condition so that zeros matrices are not input to the function.

**ct**

sparse_dot_topn function

**What Happens Here**

 Because CSR allows fast access and matrix multiplication, it is used in SciPy Sparse matrix dot function.
 
 ![1_bNMlGiPmcw4aqO2wFQX4OA.png](attachment:1_bNMlGiPmcw4aqO2wFQX4OA.png)
 
 
 
**We implement the sparse matrix multiplication and top-n selection with the following arguments:**

M: number of rows of A matrix, in our case the number of names to match

N: number of columns of B matrix, in our case the number of ground truth names

np.asarray(A.indptr, dtype=idx_dtype),np.asarray(A.indices, dtype=idx_dtype),A.data,: pointer, index and data array of A

np.asarray(B.indptr, dtype=idx_dtype), np.asarray(B.indices, dtype=idx_dtype),B.data,: pointer, index and data array of B

ntop: top-n cosine similarity score

lower_bound: if value of an element of C is less than lower_bound, the value will be replaced zero

 indptr, indices, data: pointer, index and data array of C. C is the output matrix

**some local variables are initiated as:**

sums: a sparse vector that records the multiplication result of the current row. It is initiated as an all zero vector.

next: a sparse vector that keeps a linked list of the current row. Every element points to the next column index

candidates: a list that stores all non-zero multiplication result in the current row. 

Top-n result will be select from candidates

nnz: the number of non-zero elements in current row

Cp: the row index pointer. It starts with 0.

Then, the rows of matrix A are iterated over and three main tasks are performed for every row,

    It computes the multiplication of row i of matrix A with matrix B.
    we will re-visit the sums vector and pre-select a vector candidates.
    where the top-n candidates are selected.
 



In [85]:
# calculate the similarity between two vectors of TF-IDF values the Cosine Similarity is usually used.
# result matrix in a very sparse terms and Scikit-learn deals with this nicely by returning a sparse CSR matrix.

def awesome_cossim_top(A, B, ntop, lower_bound=0):
    # force A and B as a CSR matrix.
    # If they have already been CSR, there is no overhead
    """In this function,
    First, both A and B are converted to CSR format( fast access and matrix multiplication).
    If A and B have already been CSR format, 
    there is no overhead in converting.
    Then, the number of rows of A and number of columns of B are retrieved. 
    Next, memory for output matrix is reserved as shown .
    The maximum space is M*ntop elements."""
    A = A.tocsr()
    B = B.tocsr()
    M, _ = A.shape
    _, N = B.shape
 
    idx_dtype = np.int32
 
    nnz_max = M*ntop
 
    indptr = np.zeros(M+1, dtype=idx_dtype)
    indices = np.zeros(nnz_max, dtype=idx_dtype)
    data = np.zeros(nnz_max, dtype=A.dtype)

    ct.sparse_dot_topn(
        M, N, np.asarray(A.indptr, dtype=idx_dtype),
        np.asarray(A.indices, dtype=idx_dtype),
        A.data,
        np.asarray(B.indptr, dtype=idx_dtype),
        np.asarray(B.indices, dtype=idx_dtype),
        B.data,
        ntop,
        lower_bound,
        indptr, indices, data)

    return csr_matrix((data,indices,indptr),shape=(M,N))


The following code unpacks the resulting sparse matrix.  An option to look at only the first n values is added

In [86]:
def get_matches_df(sparse_matrix, A, B, top=100):
    non_zeros = sparse_matrix.nonzero()

    sparserows = non_zeros[0]
    sparsecols = non_zeros[1]

    if top:
        nr_matches = top
    else:
        nr_matches = sparsecols.size

    left_side = np.empty([nr_matches], dtype=object)
    right_side = np.empty([nr_matches], dtype=object)
    similairity = np.zeros(nr_matches)

    for index in range(0, nr_matches):
        left_side[index] = A[sparserows[index]]
        right_side[index] = B[sparsecols[index]]
        similairity[index] = sparse_matrix.data[index]

    return pd.DataFrame({'left_side': left_side,
                         'right_side': right_side,
                         'similairity': similairity})

calling the functions


In [87]:
df_dirty = hotels6

df_clean = hotels

# print (df_dirty["name"])
# print (df_clean["name"])

vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
tf_idf_matrix_clean = vectorizer.fit_transform(df_clean['Name'].unique())
tf_idf_matrix_dirty = vectorizer.transform(df_dirty['hotels_new'].unique())

t1 = time.time()
matches = awesome_cossim_top(tf_idf_matrix_dirty, tf_idf_matrix_clean.transpose(), 1, 0)
t = time.time()-t1
print("SELFTIMED:", t)
  b 
matches_df = get_matches_df(matches, df_dirty['hotels_new'].unique(), df_clean['Name'].unique(), top=1305)
matches_df = matches_df[matches_df['similairity'] < 0.99] # For removing all exact matches
matches_df = matches_df[matches_df['similairity'] > 0.5] # For getting all matches with a similarity more than 0.5


with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(matches_df)

SELFTIMED: 2.26708722114563
                                              left_side  \
0                            Adaaran Select Meedhupparu   
3                                    Radisson Blu Azuri   
4                                       Protea Hotel by   
5                                 Ibis Budapest Centrum   
7                                  Bluebay Beach Resort   
9                                                  None   
10                                 Tropical Attitude in   
11                                       hotel in Paris   
13                                   Radisson Blu Poste   
14                                  Diamonds Mapenzi in   
17                               Adaaran Select Huduran   
19                                    Sani Valley Lodge   
20                                     Ibis Styles Nice   
23                             Anantara Bazaruto Island   
24                                    Boutique Hotel in   
26                          

In [88]:
matches=matches_df.sort_values(['similairity'], ascending=False)

In [89]:
# create a dictionary with key as the correct hotel name and value as the names extracted
gk = matches_df.groupby('right_side')
gk1= matches_df.groupby('right_side')['left_side'].apply(list).reset_index(name='new')
dicti = {k: g["left_side"].tolist() for k,g in matches_df.groupby("right_side")}
dicti

Now  we can create a function to replace the extracted values with the correct names.

In [90]:
def replace_values(series,my_dict):
    for i in series.index:
        for k,v in my_dict.items():
            for p in v:
                if series[i]==p:
                    series[i]=k
                elif series[i]==k:
                    series[i]=k
                else:
                    series[i]=series[i]
    return pd.DataFrame(series)

In [99]:
res['replaced']=replace_values(res.iloc[0:,0:]['hotels_new'],dicti)


**We finally have a df with the replaced hotels names**

In [100]:
res
finalresult = res.drop(['agency_slug','hotel_name','hotel_mapped','value'], axis=1)
finalresult

Unnamed: 0,scraped_url,country,title,description,inclusions,domain,hotels_new,replaced
0,https://www.flightcentre.co.za/product/14197346,maldives,Adaaran Select Meedhupparu,\nYour Maldives Holiday Package includes:\nRet...,\nYour Maldives Holiday Package includes:\n\nR...,flightcentre,Adaaran Select Meedhupparu,Adaaran Select Meedhupparu All Inclusive
1,https://www.flightcentre.co.za/product/5415249,south africa,Family Fun on the Wild Coast,\nYour Eastern Cape Holiday Package includes:\...,\nYour Eastern Cape Holiday Package includes:\...,flightcentre,Wild Coast Sun,Wild Coast Sun
2,https://www.flightcentre.co.za/product/16385011,mauritius,Azuri Residences by Life in Blue,\nYour Mauritius Holiday Package includes:\nRe...,\nYour Mauritius Holiday Package includes:\n\n...,flightcentre,Azuri Residences by,Azuri Residences by
3,https://www.flightcentre.co.za/product/16323810,mauritius,Seas the Day in Mauritius,\nYour Mauritius Holiday Package includes:\nRe...,\nYour Mauritius Holiday Package includes:\n\n...,flightcentre,Radisson Blu Azuri,Radisson Blu Hotel
4,https://www.flightcentre.co.za/product/13250621,south africa,Kruger Park Splendour,\nYour Kruger National Park Holiday Package in...,\nYour Kruger National Park Holiday Package in...,flightcentre,Protea Hotel by,Protea Hotel Marine
...,...,...,...,...,...,...,...,...
8374,https://www.bushbreaks.co.za/listing/ekuthulen...,south africa,Ekuthuleni Lodge,\nTwin/King Size Beds\nHot beverage facility\n...,,,welgevonden game reserve,welgevonden game reserve
8375,https://www.thompsons.co.za/holiday-packages/w...,south africa,West Coast & Namaqualand Self-Drive Package (7...,,<p>The best place to see the acres of colourfu...,,west coast and namaqualand self,west coast and namaqualand self
8376,https://www.thompsons.co.za/holiday-packages/w...,south africa,"West Coast, Flowers & Cederberg Self-Drive Pac...",,<p>The best place to see the acres of colourfu...,,"west coast, flowers and cederberg self","west coast, flowers and cederberg self"
8377,https://packages.travelstart.co.za/Zilwa-Attit...,mauritius,Zilwa Attitude Hotel,"<p style=""font-size: 16px;"">Zilwa Attitude, si...","<p style=""font-size: 16px;"">Return flights Joh...",,zilwa attitude,Zilwa Attitude


The matches look pretty satisfying!  So, by using the ngram for tokenization, TF-IDF for vector matrix, and TfidfVectorizer to count the word occurs in each document and using cosine similarity with sparse_dot_topn, we matched the strings most quickly even for the large dataset.