# Text Similarity
This Notebook focuses on calculating the similarity between 2 texts. We will be using a CSV file comprising of 2 independent features, viz. `text1` & `text2`.
Since there's no dependent variable, we have 2 options : Either treat the problem statement as an unsupervised problem and make clusters, or perform some feature engineering and treat the probem as a regression problem in supervised learning.
We would go with the supervised learning approach.

In [1]:
# Importing necessary libraries
import numpy as np
import pandas as pd

import spacy
nlp = spacy.load("en_core_web_lg") # We would require `en_core_web_lg`, the trained pipeline for the English language

In [2]:
# Reading the CSV file into 'df'
df = pd.read_csv("Precily_Text_Similarity.csv")
df.head()

Unnamed: 0,text1,text2
0,broadband challenges tv viewing the number of ...,gardener wins double in glasgow britain s jaso...
1,rap boss arrested over drug find rap mogul mar...,amnesty chief laments war failure the lack of ...
2,player burn-out worries robinson england coach...,hanks greeted at wintry premiere hollywood sta...
3,hearts of oak 3-2 cotonsport hearts of oak set...,redford s vision of sundance despite sporting ...
4,sir paul rocks super bowl crowds sir paul mcca...,mauresmo opens with victory in la amelie maure...


In [3]:
df.shape

(3000, 2)

In [4]:
cols = df.columns
cols

Index(['text1', 'text2'], dtype='object')

In [5]:
type(cols[0])

str

## Finding Similarity
- The cells below depict how similarity is calculated for 2 texts utilizing the `en_core_web_lg` pipeline

In [6]:
s1 = nlp(df['text1'][0])
s2 = nlp(df['text2'][0])
print(s1)

broadband challenges tv viewing the number of europeans with broadband has exploded over the past 12 months  with the web eating into tv viewing habits  research suggests.  just over 54 million people are hooked up to the net via broadband  up from 34 million a year ago  according to market analysts nielsen/netratings. the total number of people online in europe has broken the 100 million mark. the popularity of the net has meant that many are turning away from tv  say analysts jupiter research. it found that a quarter of web users said they spent less time watching tv in favour of the net  the report by nielsen/netratings found that the number of people with fast internet access had risen by 60% over the past year.  the biggest jump was in italy  where it rose by 120%. britain was close behind  with broadband users almost doubling in a year. the growth has been fuelled by lower prices and a wider choice of always-on  fast-net subscription plans.  twelve months ago high speed internet 

In [7]:
# Finding the similarity
s1.similarity(s2)

0.9056197797638766

In [8]:
# Checking if we have any missing values in our dataset
features_with_na = [feature for feature in df.columns if df[feature].isnull().sum()>0]
features_with_na

[]

## Feature Engineering
- We would add an extra feature `similarity score` to the dataset which would contain the similarity scores

In [10]:
# Finding the similarity scores for all rows and storing them in the newly formed column
for idx in range(3000):
    str1 = df['text1'][idx]
    str1 = nlp(str1)
    str2 = df['text2'][idx]
    str2 = nlp(str2)
    df.loc[idx, 'similarity score'] = str1.similarity(str2)

## Training ML Models
- Since we are treating this as a supervised learning problem, let's try using the `SVM` & `XGBoost` machine learning algorithms and choose the one which yields best results
- We would first have to split the data set into train data and test data
- The `similarity score` would be the dependent variable
- We would convert the `text1` & `text2` into vectors by performing feature extraction and converting and combining those texts into vectors since SVM & XGBoost require the data to be numerical

In [12]:
# Importing dependencies
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVR
# from sklearn.ensemble import RandomForestRegressor

# Train-test split
traindf, testdf = train_test_split(df, test_size=0.2, random_state=42)

vectorizer = TfidfVectorizer()

X_train = vectorizer.fit_transform(traindf['text1'] + traindf['text2'])
y_train = traindf['similarity score']
X_test = vectorizer.transform(testdf['text1'] + testdf['text2'])
y_test = testdf['similarity score']

# Training the SVM model
svm_model = SVR(kernel='linear')
svm_model.fit(X_train, y_train)

In [14]:
# Predicting the similarity scores using the SVM model
svm_predictions = svm_model.predict(X_test)

In [15]:
svm_predictions

array([0.87404816, 0.89618269, 0.85154542, 0.88178689, 0.87023372,
       0.89053894, 0.88395429, 0.89605138, 0.82466519, 0.88578716,
       0.88920097, 0.89059203, 0.90165632, 0.88667827, 0.87749108,
       0.84843911, 0.91960737, 0.87867569, 0.88421354, 0.89292077,
       0.88677809, 0.85445993, 0.89835021, 0.8931667 , 0.84933968,
       0.8886955 , 0.88879651, 0.87085866, 0.88537366, 0.82877181,
       0.8947034 , 0.87282611, 0.85325428, 0.85381077, 0.88088246,
       0.85219758, 0.86331246, 0.81701535, 0.8547805 , 0.85694684,
       0.87431252, 0.8620938 , 0.89280346, 0.85637346, 0.88726283,
       0.86086683, 0.87142117, 0.85649541, 0.86412691, 0.86678969,
       0.89145622, 0.85384307, 0.88013741, 0.87573374, 0.84540528,
       0.89421933, 0.86762312, 0.93074408, 0.8591994 , 0.85452262,
       0.87142365, 0.86138221, 0.82645866, 0.89940393, 0.87511878,
       0.89528946, 0.82786607, 0.86710292, 0.86444457, 0.88306881,
       0.89313704, 0.89583092, 0.90378615, 0.8584821 , 0.82685

In [17]:
# Evaluating the SVM Model using basic model evaluation metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

mse = mean_squared_error(y_test, svm_predictions)
mae = mean_absolute_error(y_test, svm_predictions)
r_squared = r2_score(y_test, svm_predictions)

print("Mean Squared Error : {}".format(mse))
print("Mean Absolute Error : {}".format(mae))
print("R Squared : {}".format(r_squared))

Mean Squared Error : 0.00371981079113558
Mean Absolute Error : 0.051563040367998046
R Squared : 0.19225607489514007


#### The R2 score, 0.19, suggests that SVM might not be a good option to use in our case. We need to check out the XGBoost model now

In [20]:
# Importing xgboost
import xgboost as xgb

# Training and predicting our model
xgbmodel = xgb.XGBRegressor()
xgbmodel.fit(X_train, y_train)
xgb_pred = xgbmodel.predict(X_test)

In [21]:
# Model Evaluation
mse = mean_squared_error(y_test, xgb_pred)
mae = mean_absolute_error(y_test, xgb_pred)
r2 = r2_score(y_test, xgb_pred)

print("Mean Squared Error : {}".format(mse))
print("Mean Absolute Error : {}".format(mae))
print("R Squared : {}".format(r2))

Mean Squared Error : 0.0018006245234264847
Mean Absolute Error : 0.027701822648441333
R Squared : 0.6090006718463856


#### R2 score, 0.6 (much higher than 0.19) suggests that using XGBoost might be a better option to use 

In [31]:
# Saving the trained model in the same directory for future use
xgbmodel.save_model('xgbmodel.model')

### What Next?
- Now that we have saved our model, we can head to building our Flask app for a development environment.
- We will have an `app.py` flask application and the `index.html` template to be rendered
- This notebook won't be needed anymore and can be referred to as a technical documentation as well, even though a short 1 pager report would be provided.