# Machine Learning
### TF-IDF
**TF-IDF** is a **statistical** measure  or a **feature engineering** tool that evaluates how important a word is to a document within a collection. 
- It calculates **term frequency** (how often a word appears in a document) multiplied by **inverse document frequency** (how rare the word is across all documents). 
- Common words like "the" get low scores, while rare, meaningful words get high scores. 

TF-IDF is widely used for text search, document similarity and clustering, text classification, and keyword extraction by converting text into numerical features.
<hr>

**Term Frequency** (TF) measures local importance in a document $d$, and it is usually defined by: 
<br>$\large TF(t,d)=\frac{number\; of\; times\; term\; t\; appears\; in\; d}{total\; terms\; in\; d}$
<br>
<br>**Inverse Document Frequency** (IDF) measures global rarity across $N$ documents, which may be defined by: 
<br>$\large IDF(t)=log(\frac{N}{number\; of\; documents\; containing\; t})$ 
<br>
<br>**TF-IDF Score** is defined by: 
<br>$\large TF-IDF(t,d)=TF(t,d)×IDF(t)$
- Higher TF-IDF → more discriminative term for that document. 
     

 
<hr>

In the following, we give the **Python** code to compute TF-IDF from scratch for a collection of documents (a corpus). We should **preprocess** documents before computing TF-IDF such as removing punctuation, tokenizing, and removing stopwords. For each document in a corpus (collection of documents), we get a feature vector of numerical values that represents the document. Each component of the feature vector of a document is the score of relevant term in the document. 
- The terms of high scores in a document may be considered as keywords for the document. 
- An example for a simple documents clearifies the concept.

A a **bonus**, we express some Python code based on **scikit-learn**, which shows how to use **TF-IDF** with class **TfidfVectorizer**.

<hr>

https://github.com/ostad-ai/Machine-Learning
<br> Explanation: https://www.pinterest.com/HamedShahHosseini/Machine-Learning/

In [1]:
# Import required modules
import math
from collections import Counter
import re

In [2]:
# Define a basic set of English stopwords
STOPWORDS = {
    'a', 'about', 'above', 'after', 'again', 'against', 'all', 'am', 'an', 'and',
    'any', 'are', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being',
    'below', 'between', 'both', 'but', 'by', "can't", 'cannot', 'could', "couldn't",
    'did', "didn't", 'do', 'does', "doesn't", 'doing', "don't", 'down', 'during',
    'each', 'few', 'for', 'from', 'further', 'had', "hadn't", 'has', "hasn't",
    'have', "haven't", 'having', 'he', "he'd", "he'll", "he's", 'her', 'here',
    "here's", 'hers', 'herself', 'him', 'himself', 'his', 'how', "how's", 'i',
    "i'd", "i'll", "i'm", "i've", 'if', 'in', 'into', 'is', "isn't", 'it', "it's",
    'its', 'itself', "let's", 'me', 'more', 'most', "mustn't", 'my', 'myself',
    'no', 'nor', 'not', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'ought',
    'our', 'ours', 'ourselves', 'out', 'over', 'own', 'same', "shan't", 'she',
    "she'd", "she'll", "she's", 'should', "shouldn't", 'so', 'some', 'such', 'than',
    'that', "that's", 'the', 'their', 'theirs', 'them', 'themselves', 'then',
    'there', "there's", 'these', 'they', "they'd", "they'll", "they're", "they've",
    'this', 'those', 'through', 'to', 'too', 'under', 'until', 'up', 'very', 'was',
    "wasn't", 'we', "we'd", "we'll", "we're", "we've", 'were', "weren't", 'what',
    "what's", 'when', "when's", 'where', "where's", 'which', 'while', 'who',
    "who's", 'whom', 'why', "why's", 'with', "won't", 'would', "wouldn't", 'you',
    "you'd", "you'll", "you're", "you've", 'your', 'yours', 'yourself', 'yourselves'
}

# # Add domain-specific stopwords
# STOPWORDS.update({'http', 'www', 'com', 'said', 'mr', 'ms'})

def preprocess(text):
    """
    Convert to lowercase, remove punctuation, tokenize, and remove stopwords.
    """
    # Keep only alphabetic characters and spaces
    text = re.sub(r'[^a-zA-Z\s]', ' ', text.lower())
    # Split into tokens
    tokens = text.split()
    # Remove stopwords
    tokens = [token for token in tokens if token and token not in STOPWORDS]
    return tokens

In [3]:
def compute_tf_idf(documents):
    """
    Compute TF-IDF from a list of document strings.
    
    Returns:
        tfidf_matrix: list of dicts {term: tfidf_score}
        vocabulary: sorted list of all terms
    """
    # Preprocess all docs
    processed_docs = [preprocess(doc) for doc in documents]
    N = len(processed_docs)
    
    # Build vocabulary
    vocabulary = sorted(set(term for doc in processed_docs for term in doc))
    
    # Compute TF for each doc
    tf = []
    for doc in processed_docs:
        term_count = Counter(doc)
        total_terms = len(doc)
        tf_doc = {term: term_count.get(term, 0) / total_terms for term in vocabulary}
        tf.append(tf_doc)
    
    # Compute IDF for each term
    idf = {}
    for term in vocabulary:
        doc_freq = sum(1 for doc in processed_docs if term in doc)
        idf[term] = math.log(N / (1 + doc_freq))  # +1 for smoothing
    
    # Compute TF-IDF
    tfidf_matrix = []
    for tf_doc in tf:
        tfidf_doc = {term: tf_doc[term] * idf[term] for term in vocabulary}
        tfidf_matrix.append(tfidf_doc)
    
    return tfidf_matrix, vocabulary

In [4]:
# -------------------------------
# Example Usage
# -------------------------------
# Sample document collection
docs = [
    "The cat sat on the mat and purred softly",
    "The dog chased the cat through the garden",
    "Birds fly high in the clear blue sky",
    "Machine learning is a subset of artificial intelligence",
    "Natural language processing helps computers understand text",
    "The cat and dog are common household pets"
]
tfidf_matrix, vocab = compute_tf_idf(docs)

print("Vocabulary:", vocab)
print("\nTF-IDF Scores (non-zero only):")
for i, doc_tfidf in enumerate(tfidf_matrix):
    print(f"\nDocument {i+1}:")
    for term in vocab:
        score = doc_tfidf[term]
        if score > 0:
            print(f"  {term}: {score:.4f}")

Vocabulary: ['artificial', 'birds', 'blue', 'cat', 'chased', 'clear', 'common', 'computers', 'dog', 'fly', 'garden', 'helps', 'high', 'household', 'intelligence', 'language', 'learning', 'machine', 'mat', 'natural', 'pets', 'processing', 'purred', 'sat', 'sky', 'softly', 'subset', 'text', 'understand']

TF-IDF Scores (non-zero only):

Document 1:
  cat: 0.0811
  mat: 0.2197
  purred: 0.2197
  sat: 0.2197
  softly: 0.2197

Document 2:
  cat: 0.1014
  chased: 0.2747
  dog: 0.1733
  garden: 0.2747

Document 3:
  birds: 0.1831
  blue: 0.1831
  clear: 0.1831
  fly: 0.1831
  high: 0.1831
  sky: 0.1831

Document 4:
  artificial: 0.2197
  intelligence: 0.2197
  learning: 0.2197
  machine: 0.2197
  subset: 0.2197

Document 5:
  computers: 0.1569
  helps: 0.1569
  language: 0.1569
  natural: 0.1569
  processing: 0.1569
  text: 0.1569
  understand: 0.1569

Document 6:
  cat: 0.0811
  common: 0.2197
  dog: 0.1386
  household: 0.2197
  pets: 0.2197


<hr style="height:3px; background-color:lightblue">

# Bonus
### Using scikit-learn for TF-IDF

In [5]:
# Bonus, using scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

docs = [
    "The cat sat on the mat and purred softly",
    "The dog chased the cat through the garden",
    "Birds fly high in the clear blue sky",
    "Machine learning is a subset of artificial intelligence",
    "Natural language processing helps computers understand text",
    "The cat and dog are common household pets"
]

vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(docs)

print("Feature names:", vectorizer.get_feature_names_out())
print("TF-IDF matrix:\n", tfidf_matrix.toarray())

Feature names: ['artificial' 'birds' 'blue' 'cat' 'chased' 'clear' 'common' 'computers'
 'dog' 'fly' 'garden' 'helps' 'high' 'household' 'intelligence' 'language'
 'learning' 'machine' 'mat' 'natural' 'pets' 'processing' 'purred' 'sat'
 'sky' 'softly' 'subset' 'text' 'understand']
TF-IDF matrix:
 [[0.         0.         0.         0.32711256 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.47249269 0.         0.         0.         0.47249269 0.47249269
  0.         0.47249269 0.         0.         0.        ]
 [0.         0.         0.         0.38996741 0.56328241 0.
  0.         0.         0.46189963 0.         0.56328241 0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.        ]
 [0.         0.40824829 0.40824829 0.         0.         0.40824829
  0.         0.         0.    

In [6]:
def extract_top_keywords(documents, vectorizer, n_keywords=3):
    """Extract top keywords from documents using TF-IDF"""
    tfidf_matrix = vectorizer.fit_transform(documents)
    feature_names = vectorizer.get_feature_names_out()
    
    for i, doc in enumerate(documents):
        # Get TF-IDF scores for this document
        feature_index = tfidf_matrix[i,:].nonzero()[1]
        tfidf_scores = zip(feature_index, [tfidf_matrix[i, x] for x in feature_index])
        
        # Sort by TF-IDF score
        sorted_scores = sorted(tfidf_scores, key=lambda x: x[1], reverse=True)
        
        # Get top keywords
        top_keywords = [feature_names[idx] for idx, score in sorted_scores[:n_keywords]]
        print(f"Document {i+1} top keywords: {top_keywords}")

print("=== KEYWORD EXTRACTION ===")
extract_top_keywords(docs, TfidfVectorizer(stop_words='english'))

=== KEYWORD EXTRACTION ===
Document 1 top keywords: ['softly', 'purred', 'mat']
Document 2 top keywords: ['garden', 'chased', 'dog']
Document 3 top keywords: ['sky', 'blue', 'clear']
Document 4 top keywords: ['intelligence', 'artificial', 'subset']
Document 5 top keywords: ['text', 'understand', 'computers']
Document 6 top keywords: ['pets', 'household', 'common']
