<a href="https://colab.research.google.com/github/Sakinat-Folorunso/OOU_CSC309_Artificial_Intelligence/blob/main/notebooks/CSC309_Week08_NLP2_Sentiment_Spelling_MT_Student_Centred.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CSC309 ‚Äì Artificial Intelligence  
**Week 8 Lab:** NLP II ‚Äî Sentiment, Spelling, (Optional) Translation & Speech

**Instructor:** Dr Sakinat Folorunso  

**Title:** Associate Professor of AI Systems and FAIR Data **Department:** Computer Sciences, Olabisi Onabanjo University, Ago-Iwoye, Ogun State, Nigeria

**Course Code**: CSC 309

**Mode:** Student‚Äëcentred, hands‚Äëon in Google Colab

> Every code cell is commented line‚Äëby‚Äëline so you can follow the logic precisely.

## How to use this notebook
1. Start with the **Group Log** and **Do Now**.  
2. Run the **Setup** cell once.  
3. Work through **Tasks**. Edit only cells marked **`# TODO(Student)`**.  
4. Use **Quick Checks** to test your understanding.  
5. Finish with the **Reflection**. If you finish early, try the **Extensions**.

In [None]:
#@title üßëüèΩ‚Äçü§ù‚Äçüßëüèæ Group Log (fill before you start)
# The '#@param' annotations create form fields in Colab for easy input.

group_members = "Type names here"  #@param {type:"string"}  # Names of teammates
roles_notes = "Driver/Navigator, decisions, questions"  #@param {type:"string"}  # Short working notes

print("üë• Group:", group_members)        # Echo the group list for confirmation
print("üìù Notes:", roles_notes)          # Echo the notes so they're preserved in output

### Learning Objectives
- Compare **rule‚Äëbased** vs. **ML** sentiment analysis.  
- Implement **spelling correction**; optionally try MT/Speech.

In [None]:
#@title üîß Setup
# This lab uses NLTK for sentiment resources and scikit-learn for a small ML model.

import sys, subprocess                                              # For optional installs
def pip_install(pkgs):
    for p in pkgs:
        try: __import__(p.split("==")[0])                           # Try importing
        except Exception:
            subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", p])  # Install if missing
pip_install(["nltk", "scikit-learn", "pyspellchecker"])             # Required packages

import nltk                                                         # NLP toolkit (tokenizers, corpora)
try: nltk.data.find('sentiment/vader_lexicon')                      # Try to find VADER lexicon locally
except LookupError: nltk.download('vader_lexicon')                  # If missing, download it
try: nltk.data.find('corpora/movie_reviews')                        # Movie reviews for ML demo
except LookupError: nltk.download('movie_reviews')
try: nltk.data.find('tokenizers/punkt')                             # Tokenizer (may be needed)
except LookupError: nltk.download('punkt')

print("‚úÖ Setup complete for Week 8.")

In [None]:
#@title üôÇ Sentiment: VADER (rule‚Äëbased) and LinearSVC (ML) ‚Äî fully commented
from nltk.sentiment import SentimentIntensityAnalyzer               # VADER sentiment analyzer

sia = SentimentIntensityAnalyzer()                                  # Create a VADER instance
print("VADER demo:", sia.polarity_scores("I absolutely love this course!"))  # Show a quick polarity dict

# --- ML sentiment on movie_reviews -----------------------------------------
from nltk.corpus import movie_reviews                               # Access labeled movie review data
from sklearn.feature_extraction.text import TfidfVectorizer         # Turn text into TF‚ÄëIDF features
from sklearn.svm import LinearSVC                                   # Linear SVM classifier
from sklearn.model_selection import train_test_split                # Split into train/test

# Join tokens into raw text strings for each review
docs = [" ".join(movie_reviews.words(fid)) for fid in movie_reviews.fileids()]  # List of review texts
labels = [movie_reviews.categories(fid)[0] for fid in movie_reviews.fileids()]  # 'pos' or 'neg' labels

Xtr_raw, Xte_raw, ytr, yte = train_test_split(docs, labels, test_size=0.2, random_state=42)  # Train/test split
vec = TfidfVectorizer(min_df=3)                                       # Ignore very rare terms
Xtr = vec.fit_transform(Xtr_raw)                                      # Fit vectorizer on training text
Xte = vec.transform(Xte_raw)                                          # Transform test text with same vocab
clf = LinearSVC()                                                     # Instantiate a linear SVM
clf.fit(Xtr, ytr)                                                     # Train the classifier
print("Movie review accuracy:", clf.score(Xte, yte))                  # Report accuracy

In [None]:
#@title ‚úÖ Spelling correction (pyspellchecker) ‚Äî fully commented
from spellchecker import SpellChecker                                 # Import the SpellChecker class
spell = SpellChecker()                                                # Create a spell checker instance

sentence = "Ths is a sentnce with som misspelled wrds"               # Example sentence with typos
tokens = sentence.split()                                            # Tokenize by simple whitespace split
misspelled = spell.unknown(tokens)                                   # Identify words not in dictionary
corrections = {w: spell.correction(w) for w in misspelled}           # Map each misspelled word to its best guess
print("Original:", sentence)                                         # Show the original sentence
print("Corrections:", corrections)                                   # Show the suggested corrections

In [None]:
#@title (Optional) Translation with MarianMT (large download in Colab)
# !pip -q install transformers sentencepiece                         # Install only if you want to try
# from transformers import MarianMTModel, MarianTokenizer            # Import translation model/tokenizer
# model_name = "Helsinki-NLP/opus-mt-en-fr"                          # Choose an English‚ÜíFrench model
# tok = MarianTokenizer.from_pretrained(model_name)                  # Load tokenizer
# model = MarianMTModel.from_pretrained(model_name)                  # Load model
# src = ["Artificial Intelligence is exciting."]                     # Source sentence list
# out = model.generate(**tok(src, return_tensors="pt", padding=True))# Generate translations
# print(tok.batch_decode(out, skip_special_tokens=True))             # Decode to strings