Text Similarity Score

Getting Started

The Goal of this challenge is to build an application to find the similarity score between two texts.

Using flask app to demonstrate this service. Application takes as inputs two texts and uses a metric to determine how similar they are. Documents that are exactly the same should get a score of 1, and documents that don’t have any words in common should get a score of 0.

Tasks performed

Task 1: Read input text in json format
Task 2: Preprocess the input texts (remove stopwords and punctuation)
Task 3: Generate N-grams for both texts
Task 4: Calculate similarity score
Task 5: Create Flask app
Task 6: Dockerize the end point

NOTE: Made number of decisions as I developed this solution:

I removed all the punctuations at the time of preprocessing.
In the similarity comparison, all english words matter except stop words. I removed stop words in the preprocessing step.
The ordering of words does matter. Instead of doing a word for word comparison, I tried in context of the semantics similarity by considering phrase(s).
I used the formula Intersection(A, B) / Union(A, B) to calculate similarity score for two texts A and B. Intersection(A, B) calculates the common n-grams whereas Union(A, B) calculates the unique n-grams across both texts.
I leveraged 'list' and 'set' data structures to calculate the similarity of the two texts.

Flow chart

Project Structure

Text-Similarity/
├── app/
│   ├── app.py
│   ├── stopwords.txt
│   └── text_similarity.py
├── Dockerfile
├── README.md
└── requirements.txt

Build

To build the application, run the following commands:

Local:

git clone https://github.com/goyal07nidhi/Text-Similarity.git
cd Text-Similarity/
docker build -t nidhi2019/text-similarity:latest .

From Dockerhub:

docker pull nidhi2019/text-similarity

Run

1.docker run -it --rm -p 5000:5000 nidhi2019/text-similarity

Test

Postman:

Send a POST request to http://0.0.0.0:5000/score' with the JSON input in the body

Curl:

curl localhost:5000/score -d '{"Text_1": "<text>", "Text_2": "<text>"}' -H 'Content-Type: application/json'
e.g. curl localhost:5000/score -d '{"Text_1": "The easiest way to earn points with Fetch Rewards is to just shop for the products you already love. If you have any participating brands on your receipt, you'\''ll get points based on the cost of the products. You don'\''t need to clip any coupons or scan individual barcodes. Just scan each grocery receipt after you shop and we'\''ll find the savings for you.", "Text_2": "The easiest way to earn points with Fetch Rewards is to just shop for the items you already buy. If you have any eligible brands on your receipt, you will get points based on the total cost of the products. You do not need to cut out any coupons or scan individual UPCs. Just scan your receipt after you check out and we will find the savings for you."}' -H 'Content-Type: application/json'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

app

app

.gitignore

.gitignore

Dockerfile

Dockerfile

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Text Similarity Score

Getting Started

Tasks performed

Flow chart

Project Structure

Build

Local:

From Dockerhub:

Run

Test

Postman:

Curl:

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
app		app
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
requirements.txt		requirements.txt

goyal07nidhi/Text-Similarity

Folders and files

Latest commit

History

Repository files navigation

Text Similarity Score

Getting Started

Tasks performed

Flow chart

Project Structure

Build

Local:

From Dockerhub:

Run

Test

Postman:

Curl:

About

Topics

Resources

Stars

Watchers

Forks

Languages