<a href="https://colab.research.google.com/github/marimcmurtrie/NLP/blob/main/Mari_McMurtrie_Lab_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Lab2: Mari McMurtrie**
1.   Download imdb dataset from hugging face (the train split)
2.   Create several functions:
- Sentence tokenization function
  - Apply sentence tokenization to the text
  - Return a dataframe that expands each text into multiple rows for each sentence (e.g., if the first
text has five sentences, there are now five rows for the original text)
- Text cleaning function
  - Remove non-alphanumeric characters
  - Remove stop words
  - Lemmatize text
  - Returns: cleaned text
  - Apply to each row, Create a new column with the cleaned text
- Vectorization function
  - Returns bigram document term matrix
  - Returns Tf-idf score vectors/matrix



**Download imdb dataset from hugging face (the train split)**


In [None]:
!pip install datasets
from datasets import load_dataset
import pandas as pd

dataset = load_dataset('imdb', split='train')
imdb_df = pd.DataFrame(dataset)

In [None]:
imdb_df.head()

In [None]:
imdb_df.info()  # 250k rows

In [None]:
imdb_df['label'].unique()  # 'label' contains only 1 and 0...

In [None]:
print(type(dataset))
print(dataset.column_names)
print(type(dataset['text']))
print(dataset[0]['text'])

**Sentence tokenization function**


*   Apply sentence tokenization to the text

*   Return a dataframe that expands each text into multiple rows for each sentence (e.g., if the first text has five sentences, there are now five rows for the original text)




In [None]:
import nltk
from nltk.tokenize import sent_tokenize
import time
nltk.download('punkt_tab')

start = time.time() # this takes time!!
data = []  # Will have a list of tuple(a_sentence, label)
for row in dataset:
  sentences = sent_tokenize(row['text'])
  for sentence in sentences:
    #print(f"{sentence = } ")
    data.append(
        {'text': sentence, 'label': row['label']}
    )

sentence_imdb_df = pd.DataFrame(data)

end = time.time()
elapsed_time = int(end - start)/60
print(f"It took {elapsed_time} minutes to process")
sentence_imdb_df.head()


In [None]:
sentence_imdb_df.info()

**Text cleaning function**

* Remove non-alphanumeric characters
* Remove stop words
* Lemmatize text
* Returns: cleaned text
* Apply to each row, Create a new column with the cleaned text


In [None]:
sentence_imdb_df.head()

In [None]:
!pip install nltk

In [None]:

from sklearn.feature_extraction.text import CountVectorizer

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import re

nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')

def normalize_text(corpus: list[str], lemmatizer:WordNetLemmatizer) -> list[str]:
  normalized_corpus: list[str] = []
  for sentence in corpus:
    # Remove non-alphanumeric characters
    alpha_numeric_sentence =re.sub(r'[^a-zA-Z0-9\s]', '', sentence)
    # Lower case words.
    alpha_numeric_sentence = alpha_numeric_sentence.lower()
    # Remove stop words
    stop_words = set(stopwords.words('english'))
    words = word_tokenize(alpha_numeric_sentence)
    filtered_words = [word for word in words if word.lower() not in stop_words]
    # Lemmatize text
    lemmatized_sentence = " ".join([lemmatizer.lemmatize(word) for word in filtered_words])
    normalized_corpus.append(lemmatized_sentence)
  return normalized_corpus

lemmatizer = WordNetLemmatizer()
corpus: list[str] = sentence_imdb_df['text'].tolist()
sentence_imdb_df['normalized'] = normalize_text(corpus, lemmatizer)
sentence_imdb_df.tail()

# X = vectorizer.fit_transform(corpus)
#   print(vectorizer.get_feature_names_out())



**Vectorization function:**
* Returns:
  * bigram document term matrix
  * Tf-idf score vectors/matrix


In [None]:
# bigram document term matrix
countVectorizer = CountVectorizer(lowercase=True, stop_words='english', ngram_range=(2, 2))
normalized_coprpus = sentence_imdb_df['normalized'].tolist()
print(normalized_coprpus[0:10])
X = countVectorizer.fit_transform(normalized_coprpus)
print("Bigram Document Term Matrix in Sparse Matrix")
print(X.shape)
print(X)

In [None]:
# Examine countVectorizer (bigram) with the normalized_coprpus
count = 0
for key, value in countVectorizer.vocabulary_.items():
    if count < 10:
        print(f"{key}: {value}")
        count += 1
    else:
        break


In [None]:
# Tf-idf score vectors/matrix
from sklearn.feature_extraction.text import TfidfVectorizer

tfidfVectorizer = TfidfVectorizer()
Xt = tfidfVectorizer.fit_transform(normalized_coprpus)
print(Xt.shape)
print(Xt)