### EXTRACTING BAG OF WORDS (BoW) FEATURES 

In this notebook, I will be extracting the bag of words (BoW) features from course titles and descriptions. The BoW feature is a simple but effective feature characterizing textual data and is widely used in many textual machine learning tasks.

----


Installing required libraries


In [2]:
!pip install nltk==3.6.7
!pip install gensim==4.1.2

Collecting nltk==3.6.7
  Obtaining dependency information for nltk==3.6.7 from https://files.pythonhosted.org/packages/c5/ea/84c7247f5c96c5a1b619fe822fb44052081ccfbe487a49d4c888306adec7/nltk-3.6.7-py3-none-any.whl.metadata
  Downloading nltk-3.6.7-py3-none-any.whl.metadata (2.8 kB)
Downloading nltk-3.6.7-py3-none-any.whl (1.5 MB)
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
    --------------------------------------- 0.0/1.5 MB 320.0 kB/s eta 0:00:05
    --------------------------------------- 0.0/1.5 MB 640.0 kB/s eta 0:00:03
   - -------------------------------------- 0.0/1.5 MB 279.3 kB/s eta 0:00:06
   -- ------------------------------------- 0.1/1.5 MB 456.6 kB/s eta 0:00:04
   --- ------------------------------------ 0.1/1.5 MB 595.3 kB/s eta 0:00:03
   --- ------------------------------------ 0.1/1.5 MB 532.5 kB/s eta 0:00:03
   ------ --------------------------------- 0.2/1.5 MB 835.2 kB/s eta 0:00:02
   ------- -------------------------------- 0.3/1.5 M

  error: subprocess-exited-with-error
  
  python setup.py bdist_wheel did not run successfully.
  exit code: 1
  
  [726 lines of output]
  C:\Users\marumom\AppData\Local\anaconda3\Lib\site-packages\setuptools\__init__.py:84: _DeprecatedInstaller: setuptools.installer and fetch_build_eggs are deprecated.
  !!
  
          ********************************************************************************
          Requirements should be satisfied by a PEP 517 installer.
          If you are using pip, you can try `pip install --use-pep517`.
          ********************************************************************************
  
  !!
    dist.fetch_build_eggs(dist.setup_requires)
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build\lib.win-amd64-cpython-311
  creating build\lib.win-amd64-cpython-311\gensim
  copying gensim\downloader.py -> build\lib.win-amd64-cpython-311\gensim
  copying gensim\interfaces.py -> build\lib.win-amd64-cpython-311\gen

In [3]:
import gensim
import pandas as pd
import nltk as nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from gensim import corpora

%matplotlib inline

Download stopwords


In [4]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\marumom\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\marumom\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\marumom\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [5]:
# also set a random state
rs = 123

### Extracting BoW features for course textual content and build a dataset


In [7]:
course_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/datasets/course_processed.csv"
course_content_df = pd.read_csv(course_url)

In [21]:
course_content_df.iloc[0, :]

COURSE_ID                                               ML0201EN
TITLE          robots are coming  build iot apps with watson ...
DESCRIPTION    have fun with iot and learn along the way  if ...
Name: 0, dtype: object

The course content dataset has three columns `COURSE_ID`, `TITLE`, and `DESCRIPTION`. `TITLE` and `DESCRIPTION` are all text upon which we want to extract BoW features. 


Then we join the text columns together.


In [22]:
# Merge TITLE and DESCRIPTION title
course_content_df['course_texts'] = course_content_df[['TITLE', 'DESCRIPTION']].agg(' '.join, axis=1)
course_content_df = course_content_df.reset_index()
course_content_df['index'] = course_content_df.index

In [23]:
course_content_df.iloc[0, :]

index                                                           0
COURSE_ID                                                ML0201EN
TITLE           robots are coming  build iot apps with watson ...
DESCRIPTION     have fun with iot and learn along the way  if ...
course_texts    robots are coming  build iot apps with watson ...
Name: 0, dtype: object

and we have prepared a `tokenize_course()` method for you to tokenize the course content:


In [24]:
def tokenize_course(course, keep_only_nouns=True):
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(course)
    # Remove English stop words and numbers
    word_tokens = [w for w in word_tokens if (not w.lower() in stop_words) and (not w.isnumeric())]
    # Only keep nouns 
    if keep_only_nouns:
        filter_list = ['WDT', 'WP', 'WRB', 'FW', 'IN', 'JJR', 'JJS', 'MD', 'PDT', 'POS', 'PRP', 'RB', 'RBR', 'RBS',
                       'RP']
        tags = nltk.pos_tag(word_tokens)
        word_tokens = [word for word, pos in tags if pos not in filter_list]

    return word_tokens

In [25]:
a_course = course_content_df['course_texts'].iloc[2]
a_course

'consuming restful services using the reactive jax rs client learn how to use a reactive jax rs client to asynchronously invoke restful microservices over http '

In [26]:
tokenize_course(a_course)

['consuming',
 'restful',
 'services',
 'using',
 'reactive',
 'jax',
 'rs',
 'client',
 'learn',
 'use',
 'reactive',
 'jax',
 'rs',
 'client',
 'invoke',
 'restful',
 'microservices',
 'http']

_Using provided tokenize_course() method to tokenize all courses in courses_df['course_texts']._


In [27]:
courses_df = course_content_df.explode("course_texts", ignore_index=True)
all_courses = courses_df['course_texts'].astype('string')
all_courses
all_tokenized_courses = all_courses.apply(tokenize_course)

In [34]:
all_tokenized_courses

0      [robots, coming, build, iot, apps, watson, swi...
1      [accelerating, deep, learning, gpu, training, ...
2      [consuming, restful, services, using, reactive...
3      [analyzing, big, data, r, using, apache, spark...
4      [containerizing, packaging, running, spring, b...
                             ...                        
302    [javascript, jquery, json, course, look, javas...
303    [programming, foundations, javascript, html, c...
304    [front, end, web, development, react, course, ...
305    [introduction, web, development, course, desig...
306    [interactivity, javascript, jquery, course, th...
Name: course_texts, Length: 307, dtype: object

In [32]:
courses_df

Unnamed: 0,index,COURSE_ID,TITLE,DESCRIPTION,course_texts
0,0,ML0201EN,robots are coming build iot apps with watson ...,have fun with iot and learn along the way if ...,robots are coming build iot apps with watson ...
1,1,ML0122EN,accelerating deep learning with gpu,training complex deep learning models with lar...,accelerating deep learning with gpu training c...
2,2,GPXX0ZG0EN,consuming restful services using the reactive ...,learn how to use a reactive jax rs client to a...,consuming restful services using the reactive ...
3,3,RP0105EN,analyzing big data in r using apache spark,apache spark is a popular cluster computing fr...,analyzing big data in r using apache spark apa...
4,4,GPXX0Z2PEN,containerizing packaging and running a sprin...,learn how to containerize package and run a ...,containerizing packaging and running a sprin...
...,...,...,...,...,...
302,302,excourse89,javascript jquery and json,in this course we ll look at the javascript l...,javascript jquery and json in this course w...
303,303,excourse90,programming foundations with javascript html ...,learn foundational programming concepts e g ...,programming foundations with javascript html ...
304,304,excourse91,front end web development with react,this course explores javascript based front en...,front end web development with react this cour...
305,305,excourse92,introduction to web development,this course is designed to start you on a path...,introduction to web development this course is...


Then we need to create a token dictionary `tokens_dict`


_Using gensim.corpora.Dictionary(tokenized_courses) to create a token dictionary._


In [39]:
all_tokens_dict = gensim.corpora.Dictionary(all_tokenized_courses) #second to last column named token


Then we can use `doc2bow()` method to generate BoW features for each tokenized course.


_Using tokens_dict.doc2bow() to generate BoW features for each tokenized course._


In [41]:
# Generate BoW features for each course
courses_bow = [all_tokens_dict.doc2bow(course) for course in all_tokenized_courses] #give the bow values which will be in the last column named bow

_Then we create a new course_bow dataframe based on the extracted BoW features._


In [42]:


bow_docs = courses_bow.copy()



#  ...
#  bow_dicts = {"doc_index": doc_indices,
#            "doc_id": doc_ids,
#            "token": tokens,https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/module_2/images/bow_dataset.png
#            "bow": bow_values}
#  pd.DataFrame(bow_dicts)