# Task 5: Text Similarity & Search System

Hey, so this is Task 5. I need to build a simple text search thing that finds similar texts to a query. No downloading datasets, just use my own sample texts. I'll use TF-IDF to turn words into numbers, then cosine similarity to compare. Output in console, no fancy GUI.

Requirements: 8+ samples (I did 10), vectorize, compare, show scores and best match. Notebook with markdown and code.

How it works: Texts to vectors, query to vector, find closest match. Easy peasy.

## Step 1: Import Stuff and Make Sample Texts
First, import sklearn for vectors and similarity. Then, write 10 sample texts myself. No internet stuff.

In [1]:
# Okay, importing libraries. Sklearn has everything we need.
from sklearn.feature_extraction.text import TfidfVectorizer  # This turns text to numbers
from sklearn.metrics.pairwise import cosine_similarity  # Compares how similar two things are
import numpy as np  # For arrays, I think

# My own sample texts. I wrote these, no copy-paste from anywhere.
sample_texts = [
    "Artificial Intelligence is a field of computer science that creates smart machines.",
    "Machine Learning is a type of AI where computers learn from data without programming.",
    "Deep Learning uses neural networks to solve complex problems like image recognition.",
    "Natural Language Processing helps computers understand and generate human language.",
    "Robotics combines AI with mechanical engineering to build robots.",
    "Data Science involves analyzing data to find insights using statistics and ML.",
    "Python is a popular programming language for AI and data tasks.",
    "Big Data refers to large datasets that require special tools to process.",
    "Computer Vision allows machines to see and interpret images.",
    "Ethics in AI ensures technology is used fairly and safely."
]

print("Got 10 texts. Example:", sample_texts[0])  # Just checking

Got 10 texts. Example: Artificial Intelligence is a field of computer science that creates smart machines.


## Step 2: Turn Texts into Vectors
Using TF-IDF. It counts words and makes them into numbers. Removes boring words like 'the'.

In [2]:
# Make the vectorizer. Stop words to ignore common stuff.
vectorizer = TfidfVectorizer(stop_words='english')

# Fit it to texts and get vectors. Each text becomes a list of numbers.
text_vectors = vectorizer.fit_transform(sample_texts)

print("Vectors made. Shape is", text_vectors.shape)  # Rows are texts, columns are words
print("First vector snippet:", text_vectors[0].toarray()[:5])  # Peek at first few numbers

Vectors made. Shape is (10, 66)
First vector snippet: [[0.         0.         0.         0.37350983 0.         0.
  0.         0.         0.31751748 0.         0.37350983 0.
  0.         0.         0.         0.         0.         0.
  0.37350983 0.         0.         0.         0.         0.
  0.         0.37350983 0.         0.         0.         0.
  0.         0.         0.         0.         0.31751748 0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.31751748 0.37350983
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.        ]]


## Step 3: Query and Similarity
Pick a query, turn it into vector, compare to all texts with cosine. Cosine is like how close angles are â€“ 1 is perfect match.

In [3]:
# My query. Change this if you want.
query = "What is machine learning?"

# Vectorize query same way.
query_vector = vectorizer.transform([query])

# Get similarities. This gives scores for each text.
similarities = cosine_similarity(query_vector, text_vectors)[0]

print("Query:", query)
print("Scores:")
for i, score in enumerate(similarities):
    print(f"Text {i+1}: {score:.4f} - {sample_texts[i][:50]}...")  # Show score and bit of text

Query: What is machine learning?
Scores:
Text 1: 0.0000 - Artificial Intelligence is a field of computer sci...
Text 2: 0.5339 - Machine Learning is a type of AI where computers l...
Text 3: 0.1681 - Deep Learning uses neural networks to solve comple...
Text 4: 0.0000 - Natural Language Processing helps computers unders...
Text 5: 0.0000 - Robotics combines AI with mechanical engineering t...
Text 6: 0.0000 - Data Science involves analyzing data to find insig...
Text 7: 0.0000 - Python is a popular programming language for AI an...
Text 8: 0.0000 - Big Data refers to large datasets that require spe...
Text 9: 0.0000 - Computer Vision allows machines to see and interpr...
Text 10: 0.0000 - Ethics in AI ensures technology is used fairly and...


## Step 4: Find the Best Match
Look for the highest score and show that text.

In [4]:
# Find the best one.
most_similar_index = np.argmax(similarities)
most_similar_score = similarities[most_similar_index]
most_similar_text = sample_texts[most_similar_index]

print("\nBest Match:")
print(f"Score: {most_similar_score:.4f}")
print(f"Text: {most_similar_text}")
print("So, this text is most like the query. Cool!")


Best Match:
Score: 0.5339
Text: Machine Learning is a type of AI where computers learn from data without programming.
So, this text is most like the query. Cool!
