Now we want to enhance the `get_bow_from_docs` function so that it will work with HTML webpages. In HTML, there are a lot of messy codes such as HTML tags, Javascripts, [unicodes](https://www.w3schools.com/charsets/ref_utf_misc_symbols.asp) that will mess up your bag of words. We need to clean up those junk before generating BoW.

Next, what you will do is to define several new functions each of which is specialized to clean up the HTML codes in one aspect. For instance, you can have a `strip_html_tags` function to remove all HTML tags, a `remove_punctuation` function to remove all punctuation, a `to_lower_case` function to convert string to lowercase, and a `remove_unicode` function to remove all unicodes.

Then in your `get_bow_from_doc` function, you will call each of those functions you created to clean up the HTML before you generate the corpus.

Note: Please use Python string operations and regular expression only in this lab. Do not use extra libraries such as `beautifulsoup` because otherwise you loose the purpose of practicing.

In [77]:
# Define your string handling functions below
# Minimal 3 functions
import re

def strip_html_tags(text):
    return(re.sub('<[^<]+?>', ' ', text))

print(strip_html_tags("Ironhack<div>without<balise>tags<nb>"))

def remove_punctuation(text):
    return(re.sub('[^\w\s]',' ',text))

#print(remove_punctuation("Ironhack-without:punctuation!"))
    
def to_lower_case(text):
    return(text.lower())

#print(to_lower_case("IRONHACK is LowerCase"))

def remove_unicode(text):
    return(re.sub(r'[^\x00-\x7f]',r' ', text))

#print(remove_unicode("some\x00string. with\x15 funny characters"))

Ironhack without tags 
Ironhack without punctuation 
ironhack is lowercase
some string. with funny characters


Next, paste your previously written `get_bow_from_docs` function below. Call your functions above at the appropriate place.

In [78]:
def get_bow_from_docs(docs, stop_words=[]):
    # In the function, first define the variables you will use such as `corpus`, `bag_of_words`, and `term_freq`.
    corpus = []
    bag_of_words = []
    term_freq = []
    
    # write your codes here
    for path in docs:
        r = open(path)
        readDoc = r.read()
        readDoc = remove_unicode(readDoc)
        readDoc = strip_html_tags(readDoc)
        readDoc = remove_punctuation(readDoc)       
        corpus.append(to_lower_case(readDoc))
    
    for i in corpus:
        for j in i.split():
            if j not in bag_of_words:
                bag_of_words.append(j)

            
    for i in stop_words:
        if i in bag_of_words:
            bag_of_words.remove(i)
    
    for i in corpus:
        l = i.split()
        tab = []
        for j in bag_of_words:
            tab.append(l.count(j)) 
        term_freq.append(tab)
    
    
    
    return {
        "bag_of_words": bag_of_words,
        "term_freq": term_freq
    }


Next, read the content from the three HTML webpages in the `your-codes` directory to test your function.

In [79]:
from sklearn.feature_extraction import stop_words
bow = get_bow_from_docs([
        'www.coursereport.com_ironhack.html',
        'en.wikipedia.org_Data_analysis.html',
        'www.lipsum.com.html'
    ],
    stop_words.ENGLISH_STOP_WORDS
)

print(bow)

{'bag_of_words': ['gte', '9', 'ironhack', 'reviews', 'course', 'report', 'try', 'typekit', 'load', 'catch', 'e', 'javascript_include_tag', 'oss', 'maxcdn', 'com', 'libs', 'html5shiv', '3', '7', '0', 'js', 'respond', '1', '4', '2', 'min', 'toggle', 'navigation', 'browse', 'schools', 'stack', 'web', 'development', 'mobile', 'end', 'data', 'science', 'ux', 'design', 'digital', 'marketing', 'product', 'management', 'security', 'blog', 'advice', 'ultimate', 'guide', 'choosing', 'school', 'best', 'coding', 'bootcamps', 'ui', 'cybersecurity', 'write', 'review', 'sign', 'amsterdam', 'barcelona', 'berlin', 'madrid', 'mexico', 'city', 'miami', 'paris', 'sao', 'paulo', 'avg', 'rating', '89', '596', 'courses', 'news', 'contact', 'alex', 'williams', 'week', 'time', '24', 'bootcamp', 'florida', 'spain', 'france', 'germany', 'uses', 'customized', 'approach', 'education', 'allowing', 'students', 'shape', 'experience', 'based', 'personal', 'goals', 'admissions', 'process', 'includes', 'submitting', 'wr

Do you see any problem in the output? How do you improve the output?

A good way to improve your codes is to look into the HTML data sources and try to understand where the messy output came from. A good data analyst always learns about the data in depth in order to perform the job well.

Spend 20-30 minutes to improve your functions or until you feel you are good at string operations. This lab is just a practice so you don't need to stress yourself out. If you feel you've practiced enough you can stop and move on the next challenge question.

In [80]:
import pandas as pd
#df = pd.DataFrame(bow)
len(bow["bag_of_words"])
#len(bow["term_freq"])

4474

In [81]:
for x in range(0,3):
    print(len(bow["term_freq"][x]))

4474
4474
4474


In [82]:
from pprint import pprint 

pprint(bow)

{'bag_of_words': ['gte',
                  '9',
                  'ironhack',
                  'reviews',
                  'course',
                  'report',
                  'try',
                  'typekit',
                  'load',
                  'catch',
                  'e',
                  'javascript_include_tag',
                  'oss',
                  'maxcdn',
                  'com',
                  'libs',
                  'html5shiv',
                  '3',
                  '7',
                  '0',
                  'js',
                  'respond',
                  '1',
                  '4',
                  '2',
                  'min',
                  'toggle',
                  'navigation',
                  'browse',
                  'schools',
                  'stack',
                  'web',
                  'development',
                  'mobile',
                  'end',
                  'data',
                  'science',
  

                  'low',
                  'micro',
                  'models',
                  'principal',
                  'visual',
                  'beasts',
                  'implement',
                  'designs',
                  'marketable',
                  'individual',
                  'breakouts',
                  'push',
                  'trend',
                  'generalist',
                  'larger',
                  'broader',
                  'specialized',
                  'niches',
                  'ideal',
                  'tackled',
                  'groups',
                  'flow',
                  'experts',
                  'sections',
                  'g',
                  'differ',
                  'jump',
                  'shoes',
                  'mix',
                  'include',
                  'sony',
                  'profit',
                  'crack',
                  'sector',
                  'schedule',
         

                  'fugit',
                  'consequuntur',
                  'magni',
                  'dolores',
                  'eos',
                  'ratione',
                  'sequi',
                  'nesciunt',
                  'numquam',
                  'eius',
                  'modi',
                  'tempora',
                  'incidunt',
                  'magnam',
                  'aliquam',
                  'quaerat',
                  'minima',
                  'nostrum',
                  'exercitationem',
                  'ullam',
                  'corporis',
                  'suscipit',
                  'laboriosam',
                  'aliquid',
                  'commodi',
                  'consequatur',
                  'autem',
                  'vel',
                  'eum',
                  'iure',
                  'quam',
                  'nihil',
                  'molestiae',
                  'illum',
                  'quo',
    

                1,
                1,
                3,
                1,
                1,
                4,
                1,
                1,
                1,
                23,
                1,
                2,
                9,
                2,
                10,
                3,
                3,
                2,
                7,
                2,
                1,
                7,
                11,
                6,
                2,
                7,
                5,
                6,
                1,
                23,
                3,
                1,
                5,
                5,
                2,
                2,
                1,
                2,
                3,
                7,
                1,
                7,
                14,
                2,
                1,
                3,
                3,
                1,
                2,
                1,
                2,
                2,
       

                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
            

                0,
                0,
                0,
                0,
                0,
                1,
                0,
                0,
                0,
                0,
                0,
                1,
                16,
                3,
                1,
                0,
                0,
                0,
                0,
                0,
                0,
                5,
                0,
                2,
                0,
                0,
                0,
                0,
                1,
                0,
                0,
                1,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                2,
                0,
                2,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
           

                5,
                0,
                0,
                0,
                3,
                0,
                3,
                0,
                0,
                0,
                0,
                0,
                0,
                1,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                3,
                0,
                2,
                0,
                27,
                0,
                1,
                0,
                0,
                0,
                3,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                1,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                2,
                0,
           

                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
            

                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                1,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
            

                1,
                1,
                1,
                1,
                1,
                1,
                1,
                1,
                1,
                1,
                1,
                1,
                1,
                1,
                1,
                1,
                1,
                1,
                1,
                4,
                1,
                1,
                1,
                9,
                2,
                3,
                1,
                1,
                3,
                1,
                2,
                3,
                1,
                1,
                1,
                1,
                2,
                1,
                3,
                1,
                1,
                1,
                1,
                1,
                2,
                2,
                2,
                1,
                2,
                2,
                2,
                1,
            

In [83]:
d = {}

i = 0
for sublist in bow["term_freq"]:    ###### for i, sublist in enumerate(bow["term_frq"])
    d["term_freq_"+str(i)] = sublist
    i += 1
    
d["bag_of_words"] = bow["bag_of_words"]

pprint(d)




{'bag_of_words': ['gte',
                  '9',
                  'ironhack',
                  'reviews',
                  'course',
                  'report',
                  'try',
                  'typekit',
                  'load',
                  'catch',
                  'e',
                  'javascript_include_tag',
                  'oss',
                  'maxcdn',
                  'com',
                  'libs',
                  'html5shiv',
                  '3',
                  '7',
                  '0',
                  'js',
                  'respond',
                  '1',
                  '4',
                  '2',
                  'min',
                  'toggle',
                  'navigation',
                  'browse',
                  'schools',
                  'stack',
                  'web',
                  'development',
                  'mobile',
                  'end',
                  'data',
                  'science',
  

                  'deconstruct',
                  'complex',
                  'problems',
                  'break',
                  'smaller',
                  'modules',
                  'good',
                  'general',
                  'understanding',
                  'various',
                  'languages',
                  'understands',
                  'fundamental',
                  'structure',
                  'possesses',
                  '1000',
                  'monthly',
                  'instalments',
                  '12',
                  '36',
                  'quotanda',
                  'scholarship',
                  'women',
                  'dates',
                  '14',
                  'march',
                  '25',
                  'angularjs',
                  'mongodb',
                  'express',
                  'node',
                  '15',
                  'sorted',
                  'default',
                  'so

                  'nutshell',
                  'peak',
                  'admission',
                  'committee',
                  'attracts',
                  'flight',
                  'attendants',
                  'travelling',
                  'yoginis',
                  'cs',
                  'ivy',
                  'leagues',
                  'democratic',
                  'sorts',
                  'pedigree',
                  'tend',
                  'perform',
                  'necessarily',
                  'sample',
                  'motivates',
                  'happens',
                  'function',
                  'inside',
                  'suggest',
                  'ace',
                  'midst',
                  'materials',
                  'address',
                  'https',
                  'autotelicum',
                  'io',
                  'smooth',
                  'coffeescript',
                  'cats',
                 

                  'isn',
                  'embarrassing',
                  'generators',
                  'predefined',
                  'chunks',
                  'dictionary',
                  '200',
                  'handful',
                  'sentence',
                  'reasonable',
                  'repetition',
                  'characteristic',
                  'paragraphs',
                  'bytes',
                  'lists',
                  'translations',
                  'translate',
                  'foreign',
                  'mock',
                  'banners',
                  'colours',
                  'banner',
                  'sizes',
                  'donating',
                  'hosting',
                  'bandwidth',
                  'donation',
                  'appreciated',
                  'paypal',
                  'thank',
                  'chrome',
                  'firefox',
                  'nodejs',
                  'te

                 10,
                 4,
                 1,
                 1,
                 1,
                 5,
                 2,
                 2,
                 5,
                 7,
                 3,
                 5,
                 5,
                 1,
                 2,
                 6,
                 1,
                 1,
                 2,
                 1,
                 8,
                 4,
                 1,
                 5,
                 7,
                 1,
                 8,
                 2,
                 2,
                 5,
                 2,
                 2,
                 3,
                 1,
                 1,
                 3,
                 3,
                 1,
                 4,
                 10,
                 1,
                 1,
                 3,
                 1,
                 1,
                 2,
                 9,
                 4,
                 1,
                 2

                 2,
                 1,
                 1,
                 1,
                 5,
                 1,
                 1,
                 1,
                 4,
                 5,
                 1,
                 3,
                 4,
                 4,
                 13,
                 4,
                 1,
                 4,
                 1,
                 4,
                 2,
                 5,
                 1,
                 4,
                 1,
                 2,
                 4,
                 1,
                 1,
                 3,
                 2,
                 1,
                 6,
                 1,
                 1,
                 2,
                 2,
                 1,
                 2,
                 1,
                 1,
                 1,
                 2,
                 4,
                 4,
                 4,
                 2,
                 3,
                 2,
                 2,

                 1,
                 0,
                 0,
                 1,
                 6,
                 3,
                 4,
                 1,
                 0,
                 0,
                 4,
                 0,
                 0,
                 0,
                 0,
                 2,
                 1,
                 0,
                 0,
                 0,
                 0,
                 4,
                 3,
                 1,
                 0,
                 0,
                 0,
                 0,
                 4,
                 1,
                 0,
                 1,
                 1,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 1,
                 2,
                 0,
                 0,
                 17,
                 1,
                 3,
                 0,
                 0,

                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 2,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 1,
                 4,
                 0,
                 0,
                 1,
                 0,
                 0,
                 0,
                 0,
                 1,
                 1,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 3,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 7,
                 1,
                 1,
                 0,
                 0,
                 0,


                 1,
                 1,
                 1,
                 1,
                 1,
                 1,
                 1,
                 1,
                 1,
                 1,
                 1,
                 1,
                 1,
                 1,
                 1,
                 1,
                 1,
                 1,
                 1,
                 1,
                 1,
                 1,
                 1,
                 1,
                 1,
                 1,
                 1,
                 1,
                 1,
                 1,
                 1,
                 1,
                 1,
                 1,
                 1,
                 1,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,


                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 1,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 1,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,


                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,
                 0,


In [84]:
import pandas as pd
df = pd.DataFrame(d)

In [85]:
df.sort_values("bag_of_words")

Unnamed: 0,term_freq_0,term_freq_1,term_freq_2,bag_of_words
19,21,24,1,0
134,12,0,0,000
3900,0,1,0,01311
3746,0,1,0,03
3777,0,1,0,0332
3879,0,1,0,034003
3770,0,1,0,04
3878,0,1,0,07
3964,0,1,0,09
22,14,34,8,1


In [86]:
df.sort_values("term_freq_0", ascending=False)

Unnamed: 0,term_freq_0,term_freq_1,term_freq_2,bag_of_words
2,359,0,0,ironhack
80,159,0,0,bootcamp
4,159,2,0,course
90,158,0,0,students
153,158,29,2,s
51,110,2,0,coding
92,103,0,0,experience
78,100,10,0,time
199,99,0,0,learn
143,91,0,0,job
