Now we want to enhance the `get_bow_from_docs` function so that it will work with HTML webpages. In HTML, there are a lot of messy codes such as HTML tags, Javascripts, [unicodes](https://www.w3schools.com/charsets/ref_utf_misc_symbols.asp) that will mess up your bag of words. We need to clean up those junk before generating BoW.

Next, what you will do is to define several new functions each of which is specialized to clean up the HTML codes in one aspect. For instance, you can have a `strip_html_tags` function to remove all HTML tags, a `remove_punctuation` function to remove all punctuation, a `to_lower_case` function to convert string to lowercase, and a `remove_unicode` function to remove all unicodes.

Then in your `get_bow_from_doc` function, you will call each of those functions you created to clean up the HTML before you generate the corpus.

Note: Please use Python string operations and regular expression only in this lab. Do not use extra libraries such as `beautifulsoup` because otherwise you loose the purpose of practicing.

In [67]:
# DEPENDENCIES
import re

# Define your string handling functions below
# Minimal 3 functions

def stripHtmlTags(string): # without using HTMLParser library
# this will have to be REGEX - and have to run before I remove punctuation! 
# starting with '<' and any number of characters before '>' will need to be removed.
# then after this is complete, I'll be save to call the removePunctuation() function on the string.
    return re.sub('<[^<]+?>', ' ', string)

def removePunctuation(string):
    return re.sub('[.,;?|()!--<>{}/\"=\n:_]',' ', string)

def toLowerCase(string):
    return string.lower()

# def removeUnicode(string):

# TEST EACH FUNCTION
# test_string = 'This string? "Well", she said, "It has, all sorts of <characters>'
# print(removePunctuation(test_string))
# print(toLowerCase(removePunctuation(test_string)))
# print(stripHtmlTags(test_string))
test_string = open('www.lipsum.com.html').read()
print(toLowerCase(removePunctuation(stripHtmlTags(test_string))))

        lorem ipsum   all the facts   lipsum generator                                                        googletag cmd push function     googletag display  div gpt ad 1456148316198 0                    1344   1377   1397   1381   1408   1381   1398    shqip         8235   1575   1604   1593   1585   1576   1610   1577    nbsp  nbsp     1041   1098   1083   1075   1072   1088   1089   1082   1080    catal agrave      20013   25991   31616   20307    hrvatski     268 esky   dansk   nederlands   english   eesti   filipino   suomi   fran ccedil ais     4325   4304   4320   4311   4323   4314   4312    deutsch     917   955   955   951   957   953   954   940          8235   1506   1489   1512   1497   1514    nbsp  nbsp     2361   2367   2344   2381   2342   2368    magyar   indonesia   italiano   latviski   lietuvi scaron kai     1084   1072   1082   1077   1076   1086   1085   1089   1082   1080    melayu   norsk   polski   portugu ecirc s   rom acirc na   pycc  1082   1080   1081  

  


Next, paste your previously written `get_bow_from_docs` function below. Call your functions above at the appropriate place.

In [68]:
import os

def get_bow_from_docs(docs, stop_words=[]):
    # In the function, first define the variables you will use such as `corpus`, `bag_of_words`, and `term_freq`.
    corpus = []
    bag_of_words = []
    term_freq = []
    
    # write your codes here
    # create corpus by opening and read all the files into one object
    for file in docs:
        with open(file) as f:
            corpus.append(toLowerCase(removePunctuation(stripHtmlTags(f.read())))) # here, the .read() method returns a string, which is then appended to list corpus
#     corpus = ["".join(item.lower().split(".")) for item in corpus]
#     corpus = removePunctuation(corpus)
#     corpus = toLowerCase(corpus)
    
    # assign bag_of_words
    for item in corpus:
        for word in item.split(" "):
            if bag_of_words.count(word) < 1 and str(stop_words).count(word) == 0:
                bag_of_words.append(word)
                
    
    # count the term_freq
    for line in corpus:
        mini_list = []
        for word in bag_of_words:
            mini_list.append(line.split(" ").count(word))
        term_freq.append(mini_list)

    
    return {
        "bag_of_words": bag_of_words,
        "term_freq": term_freq
    }

Next, read the content from the three HTML webpages in the `your-codes` directory to test your function.

In [69]:
from sklearn.feature_extraction import stop_words
bow = get_bow_from_docs([
        'www.coursereport.com_ironhack.html',
        'en.wikipedia.org_Data_analysis.html',
        'www.lipsum.com.html'
    ],
    stop_words.ENGLISH_STOP_WORDS
)

print(bow)

{'bag_of_words': ['[if', 'gte', '9', ']', 'ironhack', 'reviews', 'course', 'report', 'try', 'typekit', 'load', 'catch', 'javascript', 'include', 'tag', 'maxcdn', 'libs', 'html5shiv', '3', '7', '0', 'js', 'respond', '1', '4', '2', 'toggle', 'navigation', 'browse', 'schools', 'stack', 'web', 'development', 'mobile', 'end', 'data', 'science', 'ux', 'design', 'digital', 'marketing', 'product', 'management', 'security', 'blog', 'advice', 'ultimate', 'guide', 'choosing', 'school', 'best', 'coding', 'bootcamps', 'ui', 'cybersecurity', 'write', 'review', 'sign', 'amsterdam', 'barcelona', 'berlin', 'madrid', 'mexico', 'city', 'miami', 'paris', 'sao', 'paulo', 'avg', 'rating', '89', '596', 'courses', 'news', 'contact', 'alex', 'williams', 'week', '24', 'bootcamp', 'florida', 'spain', 'france', 'germany', 'uses', 'customized', 'approach', 'education', 'allowing', 'students', 'shape', 'experience', 'based', 'personal', 'goals', 'admissions', 'process', 'includes', 'submitting', 'written', 'applica

Do you see any problem in the output? How do you improve the output?

A good way to improve your codes is to look into the HTML data sources and try to understand where the messy output came from. A good data analyst always learns about the data in depth in order to perform the job well.

Spend 20-30 minutes to improve your functions or until you feel you are good at string operations. This lab is just a practice so you don't need to stress yourself out. If you feel you've practiced enough you can stop and move on the next challenge question.

Next Steps: I should clean up the function removePunctuation() to allow for only whitespace and alphanumeric characters, since there are a bunch of special symbols in the corpus.