# Cleaning
It's your turn! Udacity's [course catalog page](https://www.udacity.com/courses/all) has changed since the last video was filmed. One notable change is the introduction of  _schools_.

In this activity, you're going to perform similar actions with BeautifulSoup to extract the following information from each course listing on the page:
1. The course name - e.g. "Data Analyst"
2. The school the course belongs to - e.g. "School of Data Science"

### Step 1: Get text from Udacity's course catalog web page
You can use the `requests` library to do this.

Outputting all the javascript, CSS, and text may overload the space available to load this notebook, so we omit a print statement here.

In [1]:
import requests
import re
from bs4 import BeautifulSoup

In [2]:
#: fetch the web site
r = requests.get("https://www.udacity.com/courses/all")

### Step 2: Use BeautifulSoup to remove HTML tags
Use `"lxml"` rather than `"html5lib"`.


In [3]:
soup = BeautifulSoup(r.text, 'lxml')

In [4]:
print(r.text[:100000])

<!DOCTYPE html><html lang="en-US"><head><meta charSet="UTF-8"/><script type="text/javascript">window.NREUM||(NREUM={}),__nr_require=function(t,n,e){function r(e){if(!n[e]){var o=n[e]={exports:{}};t[e][0].call(o.exports,function(n){var o=t[e][1][n];return r(o||n)},o,o.exports)}return n[e].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<e.length;o++)r(e[o]);return r}({1:[function(t,n,e){function r(t){try{s.console&&console.log(t)}catch(n){}}var o,i=t("ee"),a=t(15),s={};try{o=localStorage.getItem("__nr_flags").split(","),console&&"function"==typeof console.log&&(s.console=!0,o.indexOf("dev")!==-1&&(s.dev=!0),o.indexOf("nr_dev")!==-1&&(s.nrDev=!0))}catch(c){}s.nrDev&&i.on("internal-error",function(t){r(t.stack)}),s.dev&&i.on("fn-err",function(t,n,e){r(e.stack)}),s.dev&&(r("NR AGENT IN DEVELOPMENT MODE"),r("flags: "+a(s,function(t,n){return t}).join(", ")))},{}],2:[function(t,n,e){function r(t,n,e,r,s){try{p?p-=1:o(s||new UncaughtException(t,n,e),!0)}catch(f){tr

### Step 3: Find all course summaries
Use the BeautifulSoup's `find_all` method to select based on tag type and class name. You can right click on the item, and click "Inspect" to view its html on a web page.

In [5]:
#: Find all course summaries
summaries = soup.find_all("div", {"class":"catalog-component__card"})
#summaries = soup.find_all("h2", {"class":"card__title__nd-name"})

print('Number of Courses', len(summaries))

Number of Courses 242


### Step 4: Inspect the first summary to find selectors for the course name and school
Tip: `.prettify()` is a super helpful method BeautifulSoup provides to output html in a nicely indented form! Make sure to use `print()` to ensure whitespace is displayed properly.

In [6]:
print(summaries[0].prettify())

<div class="catalog-component__card">
 <span class="catalog-card-tag--mobile">
  New Program!
 </span>
 <a aria-label="Introduction to Cybersecurity" class="card__top" href="/course/intro-to-cybersecurity-nanodegree--nd545">
  <div class="card__image-container">
   <div class="card__image-wrapper">
    <div class="card__image-overlay" data-catalogtype="nanodegree">
    </div>
    <div class="card__image" style="background-image:url(data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABAQMAAAAl21bKAAAAA1BMVEVMaXFNx9g6AAAAAXRSTlMAQObYZgAAAApJREFUeNpjYAAAAAIAAeUn3vwAAAAASUVORK5CYII)">
    </div>
   </div>
  </div>
  <div class="card__title-container">
   <span class="catalog-card-tag--desktop">
    New
   </span>
   <h3 class="card__title__school greyed">
    School of Programming &amp; Development
   </h3>
   <h2 class="card__title__nd-name">
    Introduction to Cybersecurity
   </h2>
  </div>
  <div class="card__text-content">
   <section>
    <h4 class="text-content__text greyed">
    

Look for selectors contain the the courses title and school name text you want to extract. Then, use the `select_one` method on the summary object to pull out the html with those selectors. Afterwards, don't forget to do some extra cleaning to isolate the names (get rid of unnecessary html).

In [7]:
#: extract course title
summaries[0].select_one("h2").get_text().strip()

'Introduction to Cybersecurity'

In [8]:
#: extract school
summaries[0].select_one("h3").get_text().strip()

'School of Programming & Development'

### Step 5: Collect names and schools of ALL course listings
Reuse your code from the previous step, but now in a loop to extract the name and school from every course summary in `summaries`!

In [9]:
courses = []
for summary in summaries:
    title = summary.select_one("h2").get_text().strip()
    school = summary.select_one("h3").get_text().strip()
    courses.append((title, school))

In [10]:
print(len(courses), "course summaries found. Sample:")
courses[:20]

242 course summaries found. Sample:


[('Introduction to Cybersecurity', 'School of Programming & Development'),
 ('Establishing Data Infrastructure', 'School of Business'),
 ('Intermediate JavaScript', 'School of Programming & Development'),
 ('Monetization Strategy', 'School of Business'),
 ('Applying Data Science to Product Management', 'School of Business'),
 ('Data Product Manager', 'School of Business'),
 ('SQL', 'School of Data Science'),
 ('Programming for Data Science with Python', 'School of Data Science'),
 ('Self Driving Car Engineer', 'School of Autonomous Systems'),
 ('Machine Learning Engineer', 'School of Artificial Intelligence'),
 ('Introduction to Programming', 'School of Programming & Development'),
 ('Deep Learning', 'School of Artificial Intelligence'),
 ('Data Scientist', 'School of Data Science'),
 ('Data Engineer', 'School of Data Science'),
 ('Data Analyst', 'School of Data Science'),
 ('UX Designer', 'School of Business'),
 ('Digital Marketing', 'School of Business'),
 ('AI for Healthcare', 'Scho

# Normalization
Normalize case in the following text and remove punctuation!

In [11]:
text = "The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?"
print(text)

The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?


### Case Normalization

In [12]:
text = text.lower()
print(text)

the first time you see the second renaissance it may look boring. look at it at least twice and definitely watch part 2. it will change your view of the matrix. are the human people the ones who started the war ? is ai a bad thing ?


### Punctuation Removal
Use the `re` library to remove punctuation with a regular expression (regex). Feel free to refer back to the video or Google to get your regular expression. You can learn more about regex [here](https://docs.python.org/3/howto/regex.html).

In [13]:
import re

#: remove punctuation characters
text = re.sub(r"[^a-zA-Z0-9]", " ", text)
print(text)

the first time you see the second renaissance it may look boring  look at it at least twice and definitely watch part 2  it will change your view of the matrix  are the human people the ones who started the war   is ai a bad thing  


### Note on NLTK data download
Run the cell below to download the necessary nltk data packages. You can download all packages by entering `nltk.download()` on your computer. Keep in mind this does take up a bit more space. You can learn more about nltk data installation [here](https://www.nltk.org/data.html).

In [14]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

# Tokenization
Try out the tokenization methods in nltk to split the following text into words and then sentences.


In [15]:
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

In [16]:
text = "Dr. Smith graduated from the University of Washington. He later started an analytics firm called Lux, which catered to enterprise customers."
print(text)

Dr. Smith graduated from the University of Washington. He later started an analytics firm called Lux, which catered to enterprise customers.


In [17]:
#: Split text into words using NLTK
words = word_tokenize(text)
print(words)

['Dr.', 'Smith', 'graduated', 'from', 'the', 'University', 'of', 'Washington', '.', 'He', 'later', 'started', 'an', 'analytics', 'firm', 'called', 'Lux', ',', 'which', 'catered', 'to', 'enterprise', 'customers', '.']


In [18]:
#: Split text into sentences
sentences = sent_tokenize(text)
print(sentences)

['Dr. Smith graduated from the University of Washington.', 'He later started an analytics firm called Lux, which catered to enterprise customers.']


# Stop Words
Combine the steps you learned so far to normalize, tokenize, and remove stop words from the text below.

In [19]:
# import statements
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [20]:
text = "The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?"
print(text)

The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?


In [21]:
#: normalize text
text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())

In [22]:
#: Tokenize the text
words = word_tokenize(text)
print(words)

['the', 'first', 'time', 'you', 'see', 'the', 'second', 'renaissance', 'it', 'may', 'look', 'boring', 'look', 'at', 'it', 'at', 'least', 'twice', 'and', 'definitely', 'watch', 'part', '2', 'it', 'will', 'change', 'your', 'view', 'of', 'the', 'matrix', 'are', 'the', 'human', 'people', 'the', 'ones', 'who', 'started', 'the', 'war', 'is', 'ai', 'a', 'bad', 'thing']


In [23]:
#: Remove stop words
words = [w for w in words if w not in stopwords.words("english")]

In [24]:
#: print the stop words included in NLTK's corpus
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

## Parts of Speech (POS) Tagging

In [25]:
from nltk import pos_tag, ne_chunk
from nltk.tokenize import word_tokenize

In [26]:
text = "I always lie down to tell a lie."

In [27]:
#: tokenize text
sentence = word_tokenize(text)
print(sentence)

['I', 'always', 'lie', 'down', 'to', 'tell', 'a', 'lie', '.']


In [28]:
#: tag each word iwth part of speech (POS)
pos_tag(sentence)

[('I', 'PRP'),
 ('always', 'RB'),
 ('lie', 'VBP'),
 ('down', 'RP'),
 ('to', 'TO'),
 ('tell', 'VB'),
 ('a', 'DT'),
 ('lie', 'NN'),
 ('.', '.')]

## Named Entity Recognition (NER)

In [29]:
text = "Antonio joined Udacity Inc. in California."

In [30]:
#: tokenize, pos tag, then recognize named entities in text
tree = ne_chunk(pos_tag(word_tokenize(text)))
print(tree)

(S
  (PERSON Antonio/NNP)
  joined/VBD
  (ORGANIZATION Udacity/NNP Inc./NNP)
  in/IN
  (GPE California/NNP)
  ./.)


### Sentence Parsing

In [31]:
# Define a custom grammar
my_grammar = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I'
VP -> V NP | VP PP
Det -> 'an' | 'my'
N -> 'elephant' | 'pajamas'
V -> 'shot'
P -> 'in'
""")
parser = nltk.ChartParser(my_grammar)

In [32]:
# Parse a sentence
sentence = word_tokenize("I shot an elephant in my pajamas")
for tree in parser.parse(sentence):
    print(tree)

(S
  (NP I)
  (VP
    (VP (V shot) (NP (Det an) (N elephant)))
    (PP (P in) (NP (Det my) (N pajamas)))))
(S
  (NP I)
  (VP
    (V shot)
    (NP (Det an) (N elephant) (PP (P in) (NP (Det my) (N pajamas))))))


# Stemming and Lemmatizing
Let's return to this example from the stop words removal quiz.

In [33]:
from nltk.corpus import stopwords

In [34]:
text = "The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?"

# Normalize text
text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())

# Tokenize text
words = text.split()
print(words)

['the', 'first', 'time', 'you', 'see', 'the', 'second', 'renaissance', 'it', 'may', 'look', 'boring', 'look', 'at', 'it', 'at', 'least', 'twice', 'and', 'definitely', 'watch', 'part', '2', 'it', 'will', 'change', 'your', 'view', 'of', 'the', 'matrix', 'are', 'the', 'human', 'people', 'the', 'ones', 'who', 'started', 'the', 'war', 'is', 'ai', 'a', 'bad', 'thing']


In [35]:
#: remove stop words
words = [w for w in words if w not in stopwords.words('english')]
print(words)

['first', 'time', 'see', 'second', 'renaissance', 'may', 'look', 'boring', 'look', 'least', 'twice', 'definitely', 'watch', 'part', '2', 'change', 'view', 'matrix', 'human', 'people', 'ones', 'started', 'war', 'ai', 'bad', 'thing']


### Stemming

In [36]:
from nltk.stem.porter import PorterStemmer

#: reduce words to their stems
stemmed = [PorterStemmer().stem(w) for w in words]
print(stemmed)

['first', 'time', 'see', 'second', 'renaiss', 'may', 'look', 'bore', 'look', 'least', 'twice', 'definit', 'watch', 'part', '2', 'chang', 'view', 'matrix', 'human', 'peopl', 'one', 'start', 'war', 'ai', 'bad', 'thing']


### Lemmatization

In [37]:
from nltk.stem.wordnet import WordNetLemmatizer

#: reduce words to their root form
lemmed = [WordNetLemmatizer().lemmatize(w) for w in words]
print(lemmed)

['first', 'time', 'see', 'second', 'renaissance', 'may', 'look', 'boring', 'look', 'least', 'twice', 'definitely', 'watch', 'part', '2', 'change', 'view', 'matrix', 'human', 'people', 'one', 'started', 'war', 'ai', 'bad', 'thing']


In [38]:
#: Lemmatize verbs by specifying pos
lemmed = [WordNetLemmatizer().lemmatize(w, pos='v') for w in words]
print(lemmed)

['first', 'time', 'see', 'second', 'renaissance', 'may', 'look', 'bore', 'look', 'least', 'twice', 'definitely', 'watch', 'part', '2', 'change', 'view', 'matrix', 'human', 'people', 'ones', 'start', 'war', 'ai', 'bad', 'thing']


# Bag of Words and TF-IDF
Below, we'll look at three useful methods of vectorizing text.
- `CountVectorizer` - Bag of Words
- `TfidfTransformer` - TF-IDF values
- `TfidfVectorizer` - Bag of Words AND TF-IDF values



In [85]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize


In [86]:
corpus = ["The first time you see The Second Renaissance it may look boring.",
        "Look at it at least twice and definitely watch part 2.",
        "It will change your view of the matrix.",
        "Are the human people the ones who started the war?",
        "Is AI a bad thing ?"]

In [87]:
stop_words = stopwords.words("english")
lemmatizer = WordNetLemmatizer()

Use the skills you learned so far to create a function `tokenize` that takes in a string of text and applies the following:
- case normalization (convert to all lowercase)
- punctuation removal
- tokenization, lemmatization, and stop word removal using `nltk`

Feel free to refer back to previous sections to complete these steps!

In [88]:
def tokenize(text):
    # normalize case and remove punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    
    # tokenize text
    tokens = word_tokenize(text)
    
    # lemmatize andremove stop words
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]

    return tokens

# `CountVectorizer` (Bag of Words)

In [126]:
from sklearn.feature_extraction.text import CountVectorizer

# initialize count vectorizer object
vect = CountVectorizer(tokenizer=tokenize)

In [127]:
# get counts of each token (word) in text data
X = vect.fit_transform(corpus)

In [128]:
# convert sparse matrix to numpy array to view
X.toarray()

array([[0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0,
        0, 0, 0],
       [1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
        0, 0, 1],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0,
        0, 1, 0],
       [0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
        0, 0, 0]])

In [129]:
# shape of sklearn vectorizer output after applying transform method.

X.shape

(5, 25)

In [130]:
# view token vocabulary and counts
vect.vocabulary_

{'first': 6,
 'time': 20,
 'see': 17,
 'second': 16,
 'renaissance': 15,
 'may': 11,
 'look': 9,
 'boring': 3,
 'least': 8,
 'twice': 21,
 'definitely': 5,
 'watch': 24,
 'part': 13,
 '2': 0,
 'change': 4,
 'view': 22,
 'matrix': 10,
 'human': 7,
 'people': 14,
 'one': 12,
 'started': 18,
 'war': 23,
 'ai': 1,
 'bad': 2,
 'thing': 19}

# `TfidfTransformer`

In [131]:
from sklearn.feature_extraction.text import TfidfTransformer

# initialize tf-idf transformer object
transformer = TfidfTransformer(smooth_idf=False)

In [132]:
X.shape

(5, 25)

In [133]:
# use counts from count vectorizer results to compute tf-idf values
tfidf = transformer.fit_transform(X)

In [134]:
# convert sparse matrix to numpy array to view
tfidf.toarray()

array([[0.        , 0.        , 0.        , 0.36419547, 0.        ,
        0.        , 0.36419547, 0.        , 0.        , 0.26745392,
        0.        , 0.36419547, 0.        , 0.        , 0.        ,
        0.36419547, 0.36419547, 0.36419547, 0.        , 0.        ,
        0.36419547, 0.        , 0.        , 0.        , 0.        ],
       [0.39105193, 0.        , 0.        , 0.        , 0.        ,
        0.39105193, 0.        , 0.        , 0.39105193, 0.28717648,
        0.        , 0.        , 0.        , 0.39105193, 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.39105193, 0.        , 0.        , 0.39105193],
       [0.        , 0.        , 0.        , 0.        , 0.57735027,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.57735027, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.57735027, 0.

In [135]:
tfidf.shape

(5, 25)

# `TfidfVectorizer`
`TfidfVectorizer` = `CountVectorizer` + `TfidfTransformer`

In [136]:
from sklearn.feature_extraction.text import TfidfVectorizer

# initialize tf-idf vectorizer object
vectorizer = TfidfVectorizer(tokenizer=tokenize)

In [137]:
# compute bag of word counts and tf-idf values
X = vectorizer.fit_transform(corpus)

In [138]:
# convert sparse matrix to numpy array to view
X.toarray()

array([[0.        , 0.        , 0.        , 0.36152912, 0.        ,
        0.        , 0.36152912, 0.        , 0.        , 0.29167942,
        0.        , 0.36152912, 0.        , 0.        , 0.        ,
        0.36152912, 0.36152912, 0.36152912, 0.        , 0.        ,
        0.36152912, 0.        , 0.        , 0.        , 0.        ],
       [0.38775666, 0.        , 0.        , 0.        , 0.        ,
        0.38775666, 0.        , 0.        , 0.38775666, 0.31283963,
        0.        , 0.        , 0.        , 0.38775666, 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.38775666, 0.        , 0.        , 0.38775666],
       [0.        , 0.        , 0.        , 0.        , 0.57735027,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.57735027, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.57735027, 0.

In [139]:
# sklearn feature names, they are sorted in alphabetic order by default.
print(vectorizer.get_feature_names())

['2', 'ai', 'bad', 'boring', 'change', 'definitely', 'first', 'human', 'least', 'look', 'matrix', 'may', 'one', 'part', 'people', 'renaissance', 'second', 'see', 'started', 'thing', 'time', 'twice', 'view', 'war', 'watch']


In [140]:
# Here we will print the sklearn tfidf vectorizer idf values after applying the fit method
# After using the fit function on the corpus the vocab has 9 words in it, and each has its idf value.

print(vectorizer.idf_)

[2.09861229 2.09861229 2.09861229 2.09861229 2.09861229 2.09861229
 2.09861229 2.09861229 2.09861229 1.69314718 2.09861229 2.09861229
 2.09861229 2.09861229 2.09861229 2.09861229 2.09861229 2.09861229
 2.09861229 2.09861229 2.09861229 2.09861229 2.09861229 2.09861229
 2.09861229]


In [141]:
# shape of sklearn tfidf vectorizer output after applying transform method.
#: five sentences so should be 5 and 35 words
X.shape

(5, 25)

In [142]:
# sklearn tfidf values for first line of the above corpus.
# Here the output is a sparse matrix

print(X[0])
print(X[0].shape)

  (0, 3)	0.36152911730069653
  (0, 9)	0.2916794154657719
  (0, 11)	0.36152911730069653
  (0, 15)	0.36152911730069653
  (0, 16)	0.36152911730069653
  (0, 17)	0.36152911730069653
  (0, 20)	0.36152911730069653
  (0, 6)	0.36152911730069653
(1, 25)


In [143]:
# sklearn tfidf values for first line of the above corpus.
# To understand the output better, here we are converting the sparse output matrix to dense matrix and printing it.
# Notice that this output is normalized using L2 normalization. sklearn does this by default.

print(X[0].toarray())

[[0.         0.         0.         0.36152912 0.         0.
  0.36152912 0.         0.         0.29167942 0.         0.36152912
  0.         0.         0.         0.36152912 0.36152912 0.36152912
  0.         0.         0.36152912 0.         0.         0.
  0.        ]]


# Compute TFIDF custom implementation

In [144]:
stop_words = stopwords.words("english")
lemmatizer = WordNetLemmatizer()

In [146]:
#: Create a list of unique words from the corpus


features = []
for doc in corpus:
    doc = re.sub(r"[^a-zA-Z0-9]", " ", doc.lower())
    for word in doc.split():
        if word not in features and word not in stop_words:
            # lemmatize andremove stop words
            features.append(lemmatizer.lemmatize(word)) 
features.sort()

    

print(features)
print(len(features))

['2', 'ai', 'bad', 'boring', 'change', 'definitely', 'first', 'human', 'least', 'look', 'matrix', 'may', 'one', 'part', 'people', 'renaissance', 'second', 'see', 'started', 'thing', 'time', 'twice', 'view', 'war', 'watch']
25


In [169]:
#: Count number of documents for each feature
counters = {}

for word in features:
    for doc in corpus:
        doc = re.sub(r"[^a-zA-Z0-9]", " ", doc.lower())
        tokens = word_tokenize(doc)
        # lemmatize andremove stop words
        tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]

        if word in tokens:
            counters[word] = counters.get(word, 0) +1
            
counters

{'2': 1,
 'ai': 1,
 'bad': 1,
 'boring': 1,
 'change': 1,
 'definitely': 1,
 'first': 1,
 'human': 1,
 'least': 1,
 'look': 2,
 'matrix': 1,
 'may': 1,
 'one': 1,
 'part': 1,
 'people': 1,
 'renaissance': 1,
 'second': 1,
 'see': 1,
 'started': 1,
 'thing': 1,
 'time': 1,
 'twice': 1,
 'view': 1,
 'war': 1,
 'watch': 1}

In [170]:
#: features : sort by the key
print(sorted(counters.keys()))


['2', 'ai', 'bad', 'boring', 'change', 'definitely', 'first', 'human', 'least', 'look', 'matrix', 'may', 'one', 'part', 'people', 'renaissance', 'second', 'see', 'started', 'thing', 'time', 'twice', 'view', 'war', 'watch']


In [171]:
from collections import Counter
from tqdm import tqdm
from scipy.sparse import csr_matrix
import math
import operator
from sklearn.preprocessing import normalize
import numpy


In [172]:
IDF= {}
totalNumDoc = len(corpus)
print(totalNumDoc)
print(features)

5
['2', 'ai', 'bad', 'boring', 'change', 'definitely', 'first', 'human', 'least', 'look', 'matrix', 'may', 'one', 'part', 'people', 'renaissance', 'second', 'see', 'started', 'thing', 'time', 'twice', 'view', 'war', 'watch']


In [173]:

#: compute IDF
IDF= {}
totalNumDoc = len(corpus)
for feature in features:
    val = 1 + math.log( (1+totalNumDoc) / (1+counters[feature]) )
    IDF[feature] = float(format(val, '.8f'))

sorted(IDF.items(), key=lambda x: x[0])
result = [IDF[key] for key in sorted(IDF.keys())]
print(result)
print(len(result))

[2.09861229, 2.09861229, 2.09861229, 2.09861229, 2.09861229, 2.09861229, 2.09861229, 2.09861229, 2.09861229, 1.69314718, 2.09861229, 2.09861229, 2.09861229, 2.09861229, 2.09861229, 2.09861229, 2.09861229, 2.09861229, 2.09861229, 2.09861229, 2.09861229, 2.09861229, 2.09861229, 2.09861229, 2.09861229]
25


In [174]:
print(vectorizer.idf_)
print(len(vectorizer.idf_))

[2.09861229 2.09861229 2.09861229 2.09861229 2.09861229 2.09861229
 2.09861229 2.09861229 2.09861229 1.69314718 2.09861229 2.09861229
 2.09861229 2.09861229 2.09861229 2.09861229 2.09861229 2.09861229
 2.09861229 2.09861229 2.09861229 2.09861229 2.09861229 2.09861229
 2.09861229]
25


In [178]:
#: compute TF for each document
from scipy import sparse
from sklearn import preprocessing
import numpy as np

tfidf = []
for doc in corpus:
    tfidfDoc = []
    for feature in features:
        #: loop thru each document
        #: count how many times the feature in the document
        doc = re.sub(r"[^a-zA-Z0-9]", " ", doc.lower())
        tokens = word_tokenize(doc)
        # lemmatize andremove stop words
        tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
        wordlist = tokens #doc.split()
        count = wordlist.count(feature)
        #: compute TF for the document
        TF = count/len(doc)
        #: compute TFIDF for the document
        TFIDF = TF * IDF[feature]
        tfidfDoc.append(TFIDF)
    tfidf.append(tfidfDoc)

tfidf_csr = sparse.csr_matrix(tfidf)
tfidf_normalized = preprocessing.normalize(tfidf_csr, norm='l2')
print(tfidf_normalized)


  (0, 3)	0.36152911733038906
  (0, 6)	0.36152911733038906
  (0, 9)	0.2916794152081504
  (0, 11)	0.36152911733038906
  (0, 15)	0.36152911733038906
  (0, 16)	0.36152911733038906
  (0, 17)	0.36152911733038906
  (0, 20)	0.36152911733038906
  (1, 0)	0.3877566601424279
  (1, 5)	0.3877566601424279
  (1, 8)	0.3877566601424279
  (1, 9)	0.31283963158643757
  (1, 13)	0.3877566601424279
  (1, 21)	0.3877566601424279
  (1, 24)	0.3877566601424279
  (2, 4)	0.5773502691896258
  (2, 10)	0.5773502691896258
  (2, 22)	0.5773502691896258
  (3, 7)	0.4472135954999579
  (3, 12)	0.4472135954999579
  (3, 14)	0.4472135954999579
  (3, 18)	0.4472135954999579
  (3, 23)	0.4472135954999579
  (4, 1)	0.5773502691896258
  (4, 2)	0.5773502691896258
  (4, 19)	0.5773502691896258
  (0, 3)	0.36152911730069653
  (0, 9)	0.2916794154657719
  (0, 11)	0.36152911730069653
  (0, 15)	0.36152911730069653
  (0, 16)	0.36152911730069653
  (0, 17)	0.36152911730069653
  (0, 20)	0.36152911730069653
  (0, 6)	0.36152911730069653
  (1, 0)	0.38

In [179]:
#: tfidf_normalized == X
print(X) 


  (0, 3)	0.36152911730069653
  (0, 9)	0.2916794154657719
  (0, 11)	0.36152911730069653
  (0, 15)	0.36152911730069653
  (0, 16)	0.36152911730069653
  (0, 17)	0.36152911730069653
  (0, 20)	0.36152911730069653
  (0, 6)	0.36152911730069653
  (1, 0)	0.38775666010579296
  (1, 13)	0.38775666010579296
  (1, 24)	0.38775666010579296
  (1, 5)	0.38775666010579296
  (1, 21)	0.38775666010579296
  (1, 8)	0.38775666010579296
  (1, 9)	0.3128396318588854
  (2, 10)	0.5773502691896258
  (2, 22)	0.5773502691896258
  (2, 4)	0.5773502691896258
  (3, 23)	0.4472135954999579
  (3, 18)	0.4472135954999579
  (3, 12)	0.4472135954999579
  (3, 14)	0.4472135954999579
  (3, 7)	0.4472135954999579
  (4, 19)	0.5773502691896258
  (4, 2)	0.5773502691896258
  (4, 1)	0.5773502691896258
