<a href="https://colab.research.google.com/github/pranalibose/LangVisionWorkshop/blob/main/HO1_Text_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Text Preprocessing Steps**

Before feeding text data into a machine learning model, it must be cleaned and standardized. Below are some key **text preprocessing steps**:

### **1. Stopwords Removal**
- Stopwords are common words (e.g., "the", "is", "and") that do not contribute much meaning to text analysis.
- Removing stopwords helps reduce dimensionality and improves computational efficiency.
- Example using NLTK in Python:
  ```python
  from nltk.corpus import stopwords
  from nltk.tokenize import word_tokenize
  import nltk
  nltk.download('stopwords')
  nltk.download('punkt')
  
  text = "This is an example of text preprocessing in NLP."
  stop_words = set(stopwords.words("english"))
  words = word_tokenize(text)
  filtered_text = [word for word in words if word.lower() not in stop_words]
  print(filtered_text)  # ['example', 'text', 'preprocessing', 'NLP', '.']
  ```

### **2. Tokenization**
- Tokenization splits text into meaningful units called **tokens**.
- Types of tokenization:
  - **Word-based**: Splits at spaces (e.g., "Machine Learning" → ["Machine", "Learning"]).
  - **Subword-based**: Uses WordPiece, Byte-Pair Encoding (BPE), or SentencePiece.
  - **Character-based**: Splits into individual characters (useful for languages without spaces).

- Example using NLTK:
  ```python
  words = word_tokenize("Tokenization splits text into words.")
  print(words)  # ['Tokenization', 'splits', 'text', 'into', 'words', '.']
  ```


## **What are Embeddings?**
- Embeddings are **numerical representations of words or documents** in a vector space.
- They capture **semantic relationships** between words.
- Example: Word vectors for **"king"**, **"queen"**, and **"man"** may capture the analogy:
  
  \[ \text{King} - \text{Man} + \text{Woman} \approx \text{Queen} \]

### **Types of Embeddings**

### **1. TF-IDF (Term Frequency-Inverse Document Frequency)**
- Measures the importance of words in a document relative to a collection of documents.
- Formula:
  \[
  TF(w) = \frac{\text{Number of times word w appears in document}}{\text{Total words in document}}
  \]
  \[
  IDF(w) = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing w}}\right)
  \]
  \[
  \text{TF-IDF}(w) = \text{TF}(w) \times \text{IDF}(w)
  \]

- Example using Scikit-Learn:
  ```python
  from sklearn.feature_extraction.text import TfidfVectorizer
  corpus = ["This is a sample document.", "This document is about NLP."]
  vectorizer = TfidfVectorizer()
  tfidf_matrix = vectorizer.fit_transform(corpus)
  print(tfidf_matrix.toarray())
  ```

### **2. Word2Vec**
- Uses a **neural network** to learn word relationships.
- Two approaches:
  - **CBOW (Continuous Bag of Words)**: Predicts a target word from surrounding words.
    \[
    P(w_t | w_{t-2}, w_{t-1}, w_{t+1}, w_{t+2})
    \]
  - **Skip-Gram**: Predicts surrounding words given a target word.
    \[
    P(w_{t-2}, w_{t-1}, w_{t+1}, w_{t+2} | w_t)
    \]

- Example using Gensim:
  ```python
  from gensim.models import Word2Vec
  sentences = [["machine", "learning", "is", "fun"], ["deep", "learning", "is", "awesome"]]
  model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
  print(model.wv["learning"])  # Word vector for 'learning'
  ```


## **Summary of Embeddings**
| Embedding Type | How It Works | Strengths | Weaknesses |
|---------------|-------------|-----------|------------|
| **TF-IDF** | Uses frequency and importance in documents | Simple and effective for keyword-based tasks | Ignores word meaning and order |
| **Word2Vec** | Learns vector representations using neural networks | Captures semantic relationships | Requires a large corpus for good results |

Would you like an example of embeddings applied to a real-world NLP problem? 🚀


In [None]:
#!pip install nltk spacy numpy gensim scikit-learn

In [None]:
import re
import nltk
import string
import spacy
import numpy as np
from gensim.models import Word2Vec
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

# Load SpaCy NLP model
nlp = spacy.load("en_core_web_sm")

# Define stopwords
stop_words = set(stopwords.words("english"))

# Sample Resume Text (Replace this with real input)
resume_text = """
John Doe
Software Engineer

Education: B.Sc. in Computer Science, XYZ University
Experience: 3 years in software development
Skills: Python, Machine Learning, NLP, TensorFlow
"""

# Function to preprocess text
def preprocess_text(text):
    text = text.lower()  # Convert to lowercase
    text = text.translate(str.maketrans("", "", string.punctuation))  # Remove punctuation
    tokens = word_tokenize(text)  # Tokenize words
    tokens = [word for word in tokens if word not in stop_words]  # Remove stopwords
    return " ".join(tokens)

# Function to extract sections
def extract_sections(text):
    sections = {}
    patterns = {
        "education": r"education[:\-\n](.*?)(?=experience|skills|$)",
        "experience": r"experience[:\-\n](.*?)(?=education|skills|$)",
        "skills": r"skills[:\-\n](.*?)(?=education|experience|$)"
    }

    for key, pattern in patterns.items():
        match = re.search(pattern, text, re.IGNORECASE | re.DOTALL)
        if match:
            sections[key] = preprocess_text(match.group(1))
        else:
            sections[key] = ""
    return sections

# Preprocess resume text
clean_text = preprocess_text(resume_text)
sections = extract_sections(resume_text)

print("Cleaned Resume Text:", clean_text)
print("Extracted Sections:", sections)

# Tokenization and Embeddings

## TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform([clean_text])
print("TF-IDF Embedding Shape:", tfidf_matrix.shape)

## Word2Vec Embeddings
def generate_word2vec_embeddings(text):
    sentences = [word_tokenize(sent) for sent in sent_tokenize(text)]
    model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
    word_vectors = {word: model.wv[word] for word in model.wv.index_to_key}
    return model, word_vectors

word2vec_model, word_vectors = generate_word2vec_embeddings(clean_text)
print("Word2Vec Vocabulary Size:", len(word_vectors))

# Example usage of embeddings
example_word = "python"
if example_word in word_vectors:
    print(f"Embedding for '{example_word}':", word_vectors[example_word])
else:
    print(f"'{example_word}' not found in vocabulary")


[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Cleaned Resume Text: john doe software engineer education bsc computer science xyz university experience 3 years software development skills python machine learning nlp tensorflow
Extracted Sections: {'education': 'bsc computer science xyz university', 'experience': '3 years software development', 'skills': 'python machine learning nlp tensorflow'}
TF-IDF Embedding Shape: (1, 19)
Word2Vec Vocabulary Size: 20
Embedding for 'python': [ 1.2923904e-03 -9.7982762e-03  4.5899418e-03 -5.3133053e-04
  6.3345684e-03  1.7845632e-03 -3.1254247e-03  7.7570681e-03
  1.5504272e-03  5.8964350e-05 -4.6103387e-03 -8.4572500e-03
 -7.7618901e-03  8.6681144e-03 -8.9204954e-03  9.0352260e-03
 -9.2851855e-03 -2.7858073e-04 -1.9112782e-03 -8.9328075e-03
  8.6294059e-03  6.

In [None]:
def extract_resume_sections(resume_text):
    """
    Extracts Education, Experience, and Skills sections from resume text.
    """

    sections = {}

    # Regular expressions to identify section headers (customize as needed)
    section_headers = {
        "education": r"\bEducation\b|\bQualifications\b|\bAcademic Details\b",  # Add variations
        "experience": r"\bExperience\b|\bProfessional Experience\b|\bWork History\b",  # Add variations
        "skills": r"\bSkills\b|\bTechnical Skills\b|\bCore Competencies\b|\bAreas of Expertise\b",  # Add variations
        "summary": r"\bSummary\b|\bObjective\b|\bCareer Objective\b|\bProfessional Summary\b" #Add variations
    }

    start_indices = {}
    end_indices = {}

    for section_name, pattern in section_headers.items():
        match = re.search(pattern, resume_text, re.IGNORECASE | re.MULTILINE)
        if match:
            start_indices[section_name] = match.start()
        else:
          print(f"Section '{section_name}' not found.")


    # Find end indices (start of next section or end of text)
    for section_name in section_headers:
       if section_name in start_indices:
          next_section_index = len(resume_text) # Default to end of text
          for other_section in section_headers:
              if other_section != section_name and other_section in start_indices and start_indices[other_section] > start_indices[section_name]:
                  next_section_index = min(next_section_index, start_indices[other_section])
          end_indices[section_name] = next_section_index
          sections[section_name] = resume_text[start_indices[section_name]:end_indices[section_name]].strip()


    return sections

resume_text = """
Summary
A highly motivated and results-oriented professional looking to leverage my strong Machine Learning Skills.

Education
Bachelor of Science in Computer Science, University X, 2020-2024
Master of Technology in Data Science, University Y, 2024-2026

Experience
Software Engineer Intern, Company A, May 2023 - Aug 2023
Data Scientist, Company B, Jan 2024 - Present

Skills
Python, Java, Machine Learning, Deep Learning, NLP, Communication, Teamwork
"""

extracted_sections = extract_resume_sections(resume_text)

for section_name, content in extracted_sections.items():
    print(f"\n{section_name.upper()}")
    print(content)


EDUCATION
Education
Bachelor of Science in Computer Science, University X, 2020-2024
Master of Technology in Data Science, University Y, 2024-2026

EXPERIENCE
Experience
Software Engineer Intern, Company A, May 2023 - Aug 2023
Data Scientist, Company B, Jan 2024 - Present

Skills
Python, Java, Machine Learning, Deep Learning, NLP, Communication, Teamwork

SKILLS
Skills.

SUMMARY
Summary
A highly motivated and results-oriented professional looking to leverage my strong Machine Learning
