# Natural Language Processing (NLP) Assignment
This assignment will guide you through the basic concepts of Natural Language Processing including:
- Text preprocessing
- Tokenization and N-grams
- Named Entity Recognition (NER)
- Converting text into numbers (vectorization)
- Word embeddings (for experienced learners)

You can run and modify the code cells below to complete the tasks.

In [1]:
# Import required libraries
import nltk
import spacy
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import numpy as np
import pandas as pd
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import ngrams

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Specter\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Specter\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\Specter\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\Specter\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Specter\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
nlp = spacy.load("en_core_web_sm")
import string

## 1. Text Preprocessing
Clean the following text by converting it to lowercase, removing punctuation and stop words.

In [10]:
nltk.download('punkt_tab', download_dir='C:/nltk_data')
nltk.download('stopwords', download_dir='C:/nltk_data')
nltk.data.path.append('C:/nltk_data')


[nltk_data] Downloading package punkt_tab to C:/nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.
[nltk_data] Downloading package stopwords to C:/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
# Sample text
text = "Natural Language Processing is a fascinating field. It combines linguistics and computer science!"

# Preprocessing function
def preprocess(text):
    # Lowercase the text
    text = text.lower()

    # Tokenize
    tokens = word_tokenize(text)

    # Remove punctuation
    tokens = [word for word in tokens if word not in string.punctuation]

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    cleaned_tokens = [word for word in tokens if word not in stop_words]

    return cleaned_tokens

# Run preprocessing
cleaned_tokens = preprocess(text)
print("Cleaned Tokens:", cleaned_tokens)


Cleaned Tokens: ['natural', 'language', 'processing', 'fascinating', 'field', 'combines', 'linguistics', 'computer', 'science']


## 2. Tokenization and N-grams
Generate bigrams (2-grams) from the cleaned tokens.

In [4]:
# Generate bigrams from cleaned tokens
bigrams = list(ngrams(cleaned_tokens, 2))
print("Bigrams:", bigrams)

Bigrams: [('natural', 'language'), ('language', 'processing'), ('processing', 'fascinating'), ('fascinating', 'field'), ('field', 'combines'), ('combines', 'linguistics'), ('linguistics', 'computer'), ('computer', 'science')]


## 3. Named Entity Recognition (NER)
Use spaCy to perform NER on a new sentence.

In [5]:
# Example sentence
sentence = "Barack Obama was born in Hawaii and was elected president in 2008."
doc = nlp(sentence)
for ent in doc.ents:
    print(ent.text, ent.label_)

Barack Obama PERSON
Hawaii GPE
2008 DATE


## 4. Converting Text to Numbers
Use CountVectorizer and TfidfVectorizer to convert a list of sentences into numeric vectors.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Sample sentences
sentences = [
    "I love machine learning.",
    "Natural language processing is a part of AI.",
    "AI is the future."
]

In [7]:
# Function to apply CountVectorizer
def apply_count_vectorizer(sentences):
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(sentences)
    return X.toarray(), vectorizer.get_feature_names_out()


In [8]:
# Function to apply TfidfVectorizer
def apply_tfidf_vectorizer(sentences):
    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(sentences)
    return X.toarray(), vectorizer.get_feature_names_out()


In [9]:
# Run both functions
count_matrix, count_features = apply_count_vectorizer(sentences)
tfidf_matrix, tfidf_features = apply_tfidf_vectorizer(sentences)

In [10]:
# Print results
print("Count Vectorizer Output:\n", count_matrix)
print("Count Vectorizer Features:", count_features)

Count Vectorizer Output:
 [[0 0 0 0 1 1 1 0 0 0 0 0]
 [1 0 1 1 0 0 0 1 1 1 1 0]
 [1 1 1 0 0 0 0 0 0 0 0 1]]
Count Vectorizer Features: ['ai' 'future' 'is' 'language' 'learning' 'love' 'machine' 'natural' 'of'
 'part' 'processing' 'the']


In [11]:
print("\nTF-IDF Vectorizer Output:\n", tfidf_matrix)
print("TF-IDF Vectorizer Features:", tfidf_features)


TF-IDF Vectorizer Output:
 [[0.         0.         0.         0.         0.57735027 0.57735027
  0.57735027 0.         0.         0.         0.         0.        ]
 [0.30650422 0.         0.30650422 0.40301621 0.         0.
  0.         0.40301621 0.40301621 0.40301621 0.40301621 0.        ]
 [0.42804604 0.5628291  0.42804604 0.         0.         0.
  0.         0.         0.         0.         0.         0.5628291 ]]
TF-IDF Vectorizer Features: ['ai' 'future' 'is' 'language' 'learning' 'love' 'machine' 'natural' 'of'
 'part' 'processing' 'the']


## 5. Word Embeddings (Advanced)
Use spaCy to get word vectors (embeddings) for given words.

In [13]:
# Note: en_core_web_sm does not have word vectors. You can install and use en_core_web_md
# Uncomment below to install and load the medium model if needed.
#!python -m spacy download en_core_web_md
nlp = spacy.load("en_core_web_md")

# Example word vector
word = nlp("machine")[0]
print("Vector for 'machine':\n", word.vector)

Vector for 'machine':
 [-0.72883    0.20718   -0.0033379 -0.0027673 -0.17204    0.023277
  0.1297    -0.2112     0.32876    0.67447    0.10047   -0.30559
  0.11213    0.22959   -0.32997    0.1389    -0.57289    2.523
 -0.32921    0.06045    0.23895    0.1091     0.19358   -0.1765
  0.11583    0.63204   -0.13644   -0.24354    0.20061   -0.50244
  0.40537   -0.38688    0.73784    0.093937  -0.30643    0.045874
  0.097915  -0.082114   0.13082   -0.039022   0.088084  -0.27023
 -0.077658  -0.0045355  0.18986   -0.063083  -0.138      0.40474
 -0.16199   -0.10953    0.22923   -0.67634   -0.65763   -0.044595
 -0.12119    0.071167   0.25993   -0.27052   -0.22474   -0.13818
  0.20692    0.87604   -0.35257   -0.1498     0.72804    0.68768
  0.19993    0.084733  -0.2234     0.11301    0.29895   -0.090119
  0.038172  -0.32912    0.014221  -0.36335    0.5898     0.10467
  0.16549    0.47199    0.078939  -0.19985    0.84014   -0.2277
 -0.22907   -0.26243   -0.32598    1.0146    -0.079235  -0.34248
  

In [14]:
# Get and print vectors
for word in sentences:
    token = nlp(word)[0]
    print(f"Vector for '{word}':\n{token.vector[:5]}...")

Vector for 'I love machine learning.':
[-0.83712   -0.40632   -0.24202   -0.37719    0.0055611]...
Vector for 'Natural language processing is a part of AI.':
[-0.66059   0.2348   -0.021227 -0.32737  -0.062493]...
Vector for 'AI is the future.':
[-0.83712   -0.40632   -0.24202   -0.37719    0.0055611]...
