In [1]:
import pandas as pd
import re
import string
import nltk

stopwords = nltk.corpus.stopwords.words("english")
ps = nltk.PorterStemmer()

data = pd.read_csv("ds/SMSSpamCollection", sep="\t", header=None)
data.columns = ["label", "msg"]
data.head()

Unnamed: 0,label,msg
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [2]:
def clean_text(txt):
    txt = "".join([c for c in txt if c not in string.punctuation])
    tokens = re.split("\W+", txt)
    txt = " ".join([ps.stem(word) for word in tokens if word not in stopwords])
    return txt

data["msg_clean"] = data["msg"].apply(lambda x: clean_text(x))
data.head()

Unnamed: 0,label,msg,msg_clean
0,ham,"Go until jurong point, crazy.. Available only ...",go jurong point crazi avail bugi n great world...
1,ham,Ok lar... Joking wif u oni...,ok lar joke wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entri 2 wkli comp win fa cup final tkt 21...
3,ham,U dun say so early hor... U c already then say...,u dun say earli hor u c alreadi say
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah i dont think goe usf live around though


In [55]:
"""
page 29 on ppt
1. Explain the result of the data.
2. Try to change by bi gram and tri gram (2, 3) observe and explain the results.

Explanation:
1.
CountVectorizer counts how much an n-gram appears in a document (in this case, for each string in the corpus).
N-grams are a collection of n successive items in a text document that may include words, numbers, symbols, and punctuation.
The n-gram range of (2, 2) means that find chunks of text that are 2 words long at minimum, and 2 words long at maximum. This will only find chunks of text that are excatly 2 words long. An n-gram range of (2, 3) would mean 2 words long at minimum, and 3 words at maximum, meaning that it would find BOTH 2- and 3-word long chunks.

fit() computes the mean and std to be used for later scaling.
transform() performs standardization by centering and scaling.
fit_transform() combines these two operations.

a. X.shape is the shape of the corpus fitted into the CharacterVectorizer object. In this case, it is (3, 8) because the corpus has 3 sentences each, and 8 because there are 8 n-grams that are generated from the function.
b. Printing the X variable itself shows the indices where n-grams with the value 1 are located. For example, typing X[0, 11] will return 1 because that n-gram exists for a specific sentence. X[0, 10] will return 0 because that n-gram is not present in a specific sentence.
c. X.toarray() converts the corpus into an array composed of 1s and 0s. 1 means that a certain n-gram (column) is present in that sentence (row), and 0 means the that the specific n-gram is not present in anywhere in a given sentence.
d. The DataFrame at the end provides a visualization on what n-grams (columns) correspond to what sentence (row), in contrast to the X.toarray() method earlier that just shows the array of 1s and 0s. The columns are arranged in alphabetical order, and is also the same order as the toarrays() method from above.

2.
Changing the ngram_range from (2, 2) to (2, 3) made the CountVectorizer capture n-grams that are 2 or 3 words long, in contrast to the other interpretation beforehand, which only yielded 2-word results.

some quickies:
1. if countvectorizer is initialized with no params, n-gram range is (1, 1)
2. countvectorizer counts how much a word/n-gram appears in a document (in this case, for each string in the corpus)
3. array size is determined by the number of sentences in the corpus, by the feature names (use get_feature_names_out()) (3x14)
4. 1 means that certain n-gram is present in the document, while 0 means it's not. for example, "sentence is" is present in corpus[0], but "document is" is not.
"""
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(2, 3)) # changed from (2, 2) to (2, 3)

corpus = ["This is a sentence is",
          "This is another sentence",
          "third document is here"]

X = cv.fit_transform(corpus)
print(X.shape)
print(X)
print(X.toarray())

df = pd.DataFrame(X.toarray(), columns=cv.get_feature_names_out()) # get_feature_names() doesn't work
print(df)

X.shape
 (3, 14)
X
   (0, 11)	1
  (0, 6)	1
  (0, 8)	1
  (0, 13)	1
  (0, 7)	1
  (1, 11)	1
  (1, 3)	1
  (1, 0)	1
  (1, 12)	1
  (1, 4)	1
  (2, 9)	1
  (2, 1)	1
  (2, 5)	1
  (2, 10)	1
  (2, 2)	1
X.toarray()
 [[0 0 0 0 0 0 1 1 1 0 0 1 0 1]
 [1 0 0 1 1 0 0 0 0 0 0 1 1 0]
 [0 1 1 0 0 1 0 0 0 1 1 0 0 0]]
df
    another sentence  document is  document is here  is another  \
0                 0            0                 0           0   
1                 1            0                 0           1   
2                 0            1                 1           0   

   is another sentence  is here  is sentence  is sentence is  sentence is  \
0                    0        0            1               1            1   
1                    1        0            0               0            0   
2                    0        1            0               0            0   

   third document  third document is  this is  this is another  \
0               0                  0        1              