## Webinar #5: Project 1 steps and tips

The purpose of this notebook is to help you successfully complete your first project. We will first build a basic search tool, which will enable you to play a little bit with python, and then we will go through the entire project 1. The mini search tool that we will build will cover almost all the type of code you will have to write in project 1. 

## Part1: Warmup: Build a mini search tool

In this warmup section, we will build a mini project that will cover all the concept that you will use during the first project. 

## Project description

You have a list of files and a sentence, and you would like to know the statistics about words in the sentence within your list of files. 
So, Our tool will take as input a sentence and output statistics about words in the sentence within your list of files.

In [679]:
# folder that contains the files
FOLDER = 'text_files'

### Time module

The time module is used to compute the computational cost of a code block. We will use it throughout this notebook

In [680]:
def convert_time(tot_time):
    """This function converts a time(in second) to hours, minutes, and seconds.
        And print the result in the following format: xh:ymin:zs, where x, y, z represent 
        respectively the hour, the minute and the second
        
        Args: tot_time (int): the time in second we want to convert
        
        Return: None
    """
    hours = int((tot_time / 3600))
    minutes = int(((tot_time %3600)/60))
    seconds = int( ( (tot_time % 3600) % 60 ) )
    
    print(f"\nTotal Elapsed Runtime: {str(hours)}h:{str(minutes)}min:{str(seconds)}s")

In [681]:
# Let's read all the file in our directory
from time import time
from os import listdir
from pprint import pprint
start_time = time()
files = listdir(FOLDER)
print('Here are the files inside my directory\n')
pprint(files)
end_time = time()
tot_time = end_time - start_time
print()
print(f'time elapsed: {tot_time*1000: .2f}ms')

Here are the files inside my directory

['.ipynb_checkpoints',
 'fifth_file_05.txt',
 'first_file_01.txt',
 'fourth_file_04.txt',
 'second_file_02.txt',
 'third_file_03.txt']

time elapsed:  2.08ms


In [682]:
del files[0]

In [683]:
files

['fifth_file_05.txt',
 'first_file_01.txt',
 'fourth_file_04.txt',
 'second_file_02.txt',
 'third_file_03.txt']

We have 5 files inside our folder. Now let's look at the text inside the individual file. 

In [684]:
import csv
import os
start_time = time()
for file in files:
    with open(os.path.join(FOLDER, file), 'r') as f:
        print(f'text in file: {f.name.split("/")[1]}')
        reader = csv.reader(f, delimiter=',')
        for row in reader:
            print(row)
        print()
end_time = time()
tot_time = end_time - start_time
print()
print(f'time elapsed: {tot_time*1000: .2f}ms')

text in file: fifth_file_05.txt
['With its budget slashed', ' the I.R.S. has pulled back across the board — except for one area where it’s been easier to keep the numbers from falling so much: audits of the poor. More than one-third of all audit targets claim the earned-income tax credit', ' one of the nation’s largest anti-poverty programs. By the hundreds of thousands', ' I.R.S. computers spit out letters that require low-income taxpayers to prove their eligibility. The counties with the highest audit rates aren’t found in the hedge fund precincts of Connecticut or the lobbyist enclaves of Northern Virginia. No', ' they’re rural', ' mostly African-American counties in the Deep South.']

text in file: first_file_01.txt
['Businesses and the wealthy benefit the most from this state of affairs. The largest corporations in America used to be audited every year. That started to change when the cuts began', ' and today', ' the audit rate has fallen by half. It’s a similar story for individu

Now that we know what is inside our file, let's process it.

### Working with dictionary

In this section we will see different way to work with dictionary.

In [685]:
# let's create a dictionary of file_name:text
# {filename: [text], ...}

file_text = {}
for file in files:
    with open(os.path.join(FOLDER, file), 'r') as f:
        reader = csv.reader(f, delimiter=',')
        for row in reader:
            file_text[file] = row
        

In [686]:
pprint(file_text)

{'fifth_file_05.txt': ['With its budget slashed',
                       ' the I.R.S. has pulled back across the board — except '
                       'for one area where it’s been easier to keep the '
                       'numbers from falling so much: audits of the poor. More '
                       'than one-third of all audit targets claim the '
                       'earned-income tax credit',
                       ' one of the nation’s largest anti-poverty programs. By '
                       'the hundreds of thousands',
                       ' I.R.S. computers spit out letters that require '
                       'low-income taxpayers to prove their eligibility. The '
                       'counties with the highest audit rates aren’t found in '
                       'the hedge fund precincts of Connecticut or the '
                       'lobbyist enclaves of Northern Virginia. No',
                       ' they’re rural',
                       ' mostly African-Ame

### Accessing a dictionary

In [687]:
# keys
keys = file_text.keys()
print('The keys in my dictionary are:\n')
pprint(list(keys))

The keys in my dictionary are:

['fifth_file_05.txt',
 'first_file_01.txt',
 'fourth_file_04.txt',
 'second_file_02.txt',
 'third_file_03.txt']


In [688]:
# values
values = file_text.values()
print('The values in my dictionary are:\n')
pprint(list(values))

The values in my dictionary are:

[['With its budget slashed',
  ' the I.R.S. has pulled back across the board — except for one area where '
  'it’s been easier to keep the numbers from falling so much: audits of the '
  'poor. More than one-third of all audit targets claim the earned-income tax '
  'credit',
  ' one of the nation’s largest anti-poverty programs. By the hundreds of '
  'thousands',
  ' I.R.S. computers spit out letters that require low-income taxpayers to '
  'prove their eligibility. The counties with the highest audit rates aren’t '
  'found in the hedge fund precincts of Connecticut or the lobbyist enclaves '
  'of Northern Virginia. No',
  ' they’re rural',
  ' mostly African-American counties in the Deep South.'],
 ['Businesses and the wealthy benefit the most from this state of affairs. The '
  'largest corporations in America used to be audited every year. That started '
  'to change when the cuts began',
  ' and today',
  ' the audit rate has fallen by half. It

In [689]:
# print key/value pair
for key, value in file_text.items():
    print(f'filename: {key}')
    pprint(value)
    print()

filename: fifth_file_05.txt
['With its budget slashed',
 ' the I.R.S. has pulled back across the board — except for one area where '
 'it’s been easier to keep the numbers from falling so much: audits of the '
 'poor. More than one-third of all audit targets claim the earned-income tax '
 'credit',
 ' one of the nation’s largest anti-poverty programs. By the hundreds of '
 'thousands',
 ' I.R.S. computers spit out letters that require low-income taxpayers to '
 'prove their eligibility. The counties with the highest audit rates aren’t '
 'found in the hedge fund precincts of Connecticut or the lobbyist enclaves of '
 'Northern Virginia. No',
 ' they’re rural',
 ' mostly African-American counties in the Deep South.']

filename: first_file_01.txt
['Businesses and the wealthy benefit the most from this state of affairs. The '
 'largest corporations in America used to be audited every year. That started '
 'to change when the cuts began',
 ' and today',
 ' the audit rate has fallen by half

Let's create a vocabulary of our documents.

### Create a vocabulary of words in our document

In this section, we will preprocess our documents. 

In [690]:
# load our text
corpus = [file for file in file_text.values()]

In [691]:
pprint(corpus)

[['With its budget slashed',
  ' the I.R.S. has pulled back across the board — except for one area where '
  'it’s been easier to keep the numbers from falling so much: audits of the '
  'poor. More than one-third of all audit targets claim the earned-income tax '
  'credit',
  ' one of the nation’s largest anti-poverty programs. By the hundreds of '
  'thousands',
  ' I.R.S. computers spit out letters that require low-income taxpayers to '
  'prove their eligibility. The counties with the highest audit rates aren’t '
  'found in the hedge fund precincts of Connecticut or the lobbyist enclaves '
  'of Northern Virginia. No',
  ' they’re rural',
  ' mostly African-American counties in the Deep South.'],
 ['Businesses and the wealthy benefit the most from this state of affairs. The '
  'largest corporations in America used to be audited every year. That started '
  'to change when the cuts began',
  ' and today',
  ' the audit rate has fallen by half. It’s a similar story for individuals

In [692]:
# let's flatten our corpus
flatten_corpus = [text for document in corpus for text in document]

In [693]:
pprint(flatten_corpus)

['With its budget slashed',
 ' the I.R.S. has pulled back across the board — except for one area where '
 'it’s been easier to keep the numbers from falling so much: audits of the '
 'poor. More than one-third of all audit targets claim the earned-income tax '
 'credit',
 ' one of the nation’s largest anti-poverty programs. By the hundreds of '
 'thousands',
 ' I.R.S. computers spit out letters that require low-income taxpayers to '
 'prove their eligibility. The counties with the highest audit rates aren’t '
 'found in the hedge fund precincts of Connecticut or the lobbyist enclaves of '
 'Northern Virginia. No',
 ' they’re rural',
 ' mostly African-American counties in the Deep South.',
 'Businesses and the wealthy benefit the most from this state of affairs. The '
 'largest corporations in America used to be audited every year. That started '
 'to change when the cuts began',
 ' and today',
 ' the audit rate has fallen by half. It’s a similar story for individuals '
 'making $10 mil

Now that we have or corpus we can now preprocess it

In [694]:
# This print a list of punctuation
punc = string.punctuation

In [695]:
punc

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [696]:
# remove punctuation
flatten_corpus_without_punc = [s.translate(str.maketrans('', '', string.punctuation)) for s in flatten_corpus]

In [697]:
pprint(flatten_corpus_without_punc)

['With its budget slashed',
 ' the IRS has pulled back across the board — except for one area where it’s '
 'been easier to keep the numbers from falling so much audits of the poor More '
 'than onethird of all audit targets claim the earnedincome tax credit',
 ' one of the nation’s largest antipoverty programs By the hundreds of '
 'thousands',
 ' IRS computers spit out letters that require lowincome taxpayers to prove '
 'their eligibility The counties with the highest audit rates aren’t found in '
 'the hedge fund precincts of Connecticut or the lobbyist enclaves of Northern '
 'Virginia No',
 ' they’re rural',
 ' mostly AfricanAmerican counties in the Deep South',
 'Businesses and the wealthy benefit the most from this state of affairs The '
 'largest corporations in America used to be audited every year That started '
 'to change when the cuts began',
 ' and today',
 ' the audit rate has fallen by half It’s a similar story for individuals '
 'making 10 million or more a year With 

In [698]:
# split by whitespace
flatten_corpus_without_whitespace = [elt.split() for elt in flatten_corpus_without_punc]

In [699]:
print(flatten_corpus_without_whitespace)

[['With', 'its', 'budget', 'slashed'], ['the', 'IRS', 'has', 'pulled', 'back', 'across', 'the', 'board', '—', 'except', 'for', 'one', 'area', 'where', 'it’s', 'been', 'easier', 'to', 'keep', 'the', 'numbers', 'from', 'falling', 'so', 'much', 'audits', 'of', 'the', 'poor', 'More', 'than', 'onethird', 'of', 'all', 'audit', 'targets', 'claim', 'the', 'earnedincome', 'tax', 'credit'], ['one', 'of', 'the', 'nation’s', 'largest', 'antipoverty', 'programs', 'By', 'the', 'hundreds', 'of', 'thousands'], ['IRS', 'computers', 'spit', 'out', 'letters', 'that', 'require', 'lowincome', 'taxpayers', 'to', 'prove', 'their', 'eligibility', 'The', 'counties', 'with', 'the', 'highest', 'audit', 'rates', 'aren’t', 'found', 'in', 'the', 'hedge', 'fund', 'precincts', 'of', 'Connecticut', 'or', 'the', 'lobbyist', 'enclaves', 'of', 'Northern', 'Virginia', 'No'], ['they’re', 'rural'], ['mostly', 'AfricanAmerican', 'counties', 'in', 'the', 'Deep', 'South'], ['Businesses', 'and', 'the', 'wealthy', 'benefit', 'th

In [700]:
# flatten the flatten_corpus_without_whitespace
flatten_whitespace_corpus = [text for document in flatten_corpus_without_whitespace for text in document]

In [701]:
print(flatten_whitespace_corpus)

['With', 'its', 'budget', 'slashed', 'the', 'IRS', 'has', 'pulled', 'back', 'across', 'the', 'board', '—', 'except', 'for', 'one', 'area', 'where', 'it’s', 'been', 'easier', 'to', 'keep', 'the', 'numbers', 'from', 'falling', 'so', 'much', 'audits', 'of', 'the', 'poor', 'More', 'than', 'onethird', 'of', 'all', 'audit', 'targets', 'claim', 'the', 'earnedincome', 'tax', 'credit', 'one', 'of', 'the', 'nation’s', 'largest', 'antipoverty', 'programs', 'By', 'the', 'hundreds', 'of', 'thousands', 'IRS', 'computers', 'spit', 'out', 'letters', 'that', 'require', 'lowincome', 'taxpayers', 'to', 'prove', 'their', 'eligibility', 'The', 'counties', 'with', 'the', 'highest', 'audit', 'rates', 'aren’t', 'found', 'in', 'the', 'hedge', 'fund', 'precincts', 'of', 'Connecticut', 'or', 'the', 'lobbyist', 'enclaves', 'of', 'Northern', 'Virginia', 'No', 'they’re', 'rural', 'mostly', 'AfricanAmerican', 'counties', 'in', 'the', 'Deep', 'South', 'Businesses', 'and', 'the', 'wealthy', 'benefit', 'the', 'most', '

In [702]:
# remove trailing white space and convert the text in lower case
flatten_whitespace_corpus = [text.strip().lower() for text in flatten_whitespace_corpus]

In [703]:
print(flatten_whitespace_corpus)

['with', 'its', 'budget', 'slashed', 'the', 'irs', 'has', 'pulled', 'back', 'across', 'the', 'board', '—', 'except', 'for', 'one', 'area', 'where', 'it’s', 'been', 'easier', 'to', 'keep', 'the', 'numbers', 'from', 'falling', 'so', 'much', 'audits', 'of', 'the', 'poor', 'more', 'than', 'onethird', 'of', 'all', 'audit', 'targets', 'claim', 'the', 'earnedincome', 'tax', 'credit', 'one', 'of', 'the', 'nation’s', 'largest', 'antipoverty', 'programs', 'by', 'the', 'hundreds', 'of', 'thousands', 'irs', 'computers', 'spit', 'out', 'letters', 'that', 'require', 'lowincome', 'taxpayers', 'to', 'prove', 'their', 'eligibility', 'the', 'counties', 'with', 'the', 'highest', 'audit', 'rates', 'aren’t', 'found', 'in', 'the', 'hedge', 'fund', 'precincts', 'of', 'connecticut', 'or', 'the', 'lobbyist', 'enclaves', 'of', 'northern', 'virginia', 'no', 'they’re', 'rural', 'mostly', 'africanamerican', 'counties', 'in', 'the', 'deep', 'south', 'businesses', 'and', 'the', 'wealthy', 'benefit', 'the', 'most', '

In [704]:
vocabulary = sorted(list(set(flatten_whitespace_corpus)))

In [705]:
print(vocabulary)

['10', '90', 'a', 'about', 'accurately', 'across', 'actually', 'admission', 'affairs', 'africanamerican', 'agency', 'all', 'also', 'america', 'and', 'antipoverty', 'appears', 'are', 'area', 'aren’t', 'arguments', 'armed', 'army', 'as', 'assets', 'at', 'audit', 'audited', 'audits', 'back', 'be', 'been', 'began', 'benefit', 'board', 'bottom', 'budget', 'buildings', 'businesses', 'by', 'can', 'certainly', 'chance', 'change', 'check', 'claim', 'clubs', 'complicated', 'computers', 'connecticut', 'continue', 'corporations', 'countering', 'counties', 'credit', 'cuts', 'dates', 'deep', 'difficulty', 'disappear', 'domestic', 'done', 'earnedincome', 'earners', 'easier', 'easily', 'effect', 'eligibility', 'else', 'enclaves', 'enforce', 'enforcement', 'escaping', 'every', 'except', 'fallen', 'falling', 'fixing', 'for', 'found', 'fred', 'from', 'fund', 'game', 'half', 'has', 'have', 'having', 'hedge', 'help', 'hide', 'highest', 'his', 'how', 'hundreds', 'in', 'inadequate', 'income', 'increasing', '

Now that we have a vocabulary of our corpus, let's create dictionary using dictionaries comprehension

In [707]:
# utily function that summarise the steps we did earlier
def preprocess_text(text):
    """This function take a string text, remove punctuations
        strip all trailing whitespaces, and convert its characters to lower case.
        Args: text (string)- to preprocess string
        Return: 
            text (string): the preprocessed string      
    """
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = text.strip()
    text = text.split()
    text = [s.strip().lower() for s in text]
    return text
    

In [708]:
def preprocess_file(filename):
    """This function take a filename and preprocess sentences within it.
        Args:
            filename (string): the file to preprocess
        Return:
            result (list): list of preprocessed sentences
    """
    result = []
    with open(os.path.join(FOLDER, filename), 'r') as f:
        reader = csv.reader(f, delimiter=',')
        for row in reader:
            for elt in row:
                text = preprocess_text(elt)
                result = result + text
            
    return result      

In [710]:
# 
def preprocess_index(filename):
    """preprocess a file and return the index list representing the location
        within the vocabulary of each word in the file.
        Args: filename (string): file to preprocess
        Return 
            A list of integer, representing the location of the words within our vocabulary
    """
    return [vocabulary.index(elt) for elt in preprocess_file(filename)]

In [713]:
# here let's use dictionary comprehension
# the goal here is to have the following dictionary
#{ 
#  file1: [list of word in the file],
#  file2: [list of word in the file],
# file3: [list of word in the file]
#}
# file to vocabulary list
file_to_voc = {filename: preprocess_file(filename) for filename in file_text}

In [714]:
print(file_to_voc)

{'fifth_file_05.txt': ['with', 'its', 'budget', 'slashed', 'the', 'irs', 'has', 'pulled', 'back', 'across', 'the', 'board', '—', 'except', 'for', 'one', 'area', 'where', 'it’s', 'been', 'easier', 'to', 'keep', 'the', 'numbers', 'from', 'falling', 'so', 'much', 'audits', 'of', 'the', 'poor', 'more', 'than', 'onethird', 'of', 'all', 'audit', 'targets', 'claim', 'the', 'earnedincome', 'tax', 'credit', 'one', 'of', 'the', 'nation’s', 'largest', 'antipoverty', 'programs', 'by', 'the', 'hundreds', 'of', 'thousands', 'irs', 'computers', 'spit', 'out', 'letters', 'that', 'require', 'lowincome', 'taxpayers', 'to', 'prove', 'their', 'eligibility', 'the', 'counties', 'with', 'the', 'highest', 'audit', 'rates', 'aren’t', 'found', 'in', 'the', 'hedge', 'fund', 'precincts', 'of', 'connecticut', 'or', 'the', 'lobbyist', 'enclaves', 'of', 'northern', 'virginia', 'no', 'they’re', 'rural', 'mostly', 'africanamerican', 'counties', 'in', 'the', 'deep', 'south'], 'first_file_01.txt': ['businesses', 'and', 

In [715]:
# the goal here is to have the following dictionary
# the index is the position of the word in our vocabulary list
#{ 
#  file1: [list of indexes]
#  file2: [list of indexes]
# file3: [list of word]
#}
# file to index list
file_to_index = {filename: preprocess_index(filename) for filename in file_text}

In [716]:
print(file_to_index)

{'fifth_file_05.txt': [232, 104, 36, 178, 197, 100, 85, 161, 29, 5, 197, 34, 237, 74, 78, 143, 18, 227, 105, 31, 64, 205, 107, 197, 138, 81, 76, 179, 132, 28, 140, 197, 153, 129, 195, 144, 140, 11, 26, 192, 45, 197, 62, 193, 54, 143, 140, 197, 133, 108, 15, 158, 39, 197, 94, 140, 203, 100, 48, 184, 146, 113, 196, 168, 120, 194, 205, 159, 198, 67, 197, 53, 232, 197, 91, 26, 164, 19, 79, 95, 197, 88, 82, 155, 140, 49, 145, 197, 117, 69, 140, 137, 220, 136, 199, 173, 131, 9, 53, 95, 197, 57, 182], 'first_file_01.txt': [38, 14, 197, 223, 33, 197, 130, 81, 200, 186, 140, 8, 197, 108, 51, 95, 13, 218, 205, 30, 27, 73, 235, 196, 185, 205, 43, 226, 197, 55, 32, 14, 206, 197, 26, 163, 85, 75, 39, 84, 105, 2, 176, 188, 78, 99, 123, 0, 126, 145, 129, 2, 235, 232, 213, 197, 42, 140, 72, 100, 175, 197, 216, 17, 132, 112, 116, 205, 119, 25, 197, 83, 140, 26, 172], 'fourth_file_04.txt': [105, 132, 64, 205, 70, 197, 193, 109, 78, 197, 35, 1, 151, 140, 63, 221, 17, 167, 189, 205, 197, 100, 14, 48, 40, 

One quick note about dictionary in python, it throws an error when you try to access a key that does not exit

In [717]:
d = {}
print(d['one'])

KeyError: 'one'

default dict is more safe

In [719]:
from collections import defaultdict
df_int = defaultdict(int) # here we create a dictionary of values int, it will default to zero
print(df_int)
print(f'default dict int: {df_int["one"]}')
df_float = defaultdict(float)
print(df_float)
print(f'default dict float: {df_float["one"]}')
df_list = defaultdict(list)
print(df_list)
print(f'default dict list: {df_list["one"]}')
print(df_custom)
df_custom = defaultdict(lambda: 'hello world')
print(f'default dict custom: {df_custom["one"]}')

defaultdict(<class 'int'>, {})
default dict int: 0
defaultdict(<class 'float'>, {})
default dict float: 0.0
defaultdict(<class 'list'>, {})
default dict list: []
defaultdict(<function <lambda> at 0x12ddac268>, {'one': 'hello world'})
default dict custom: hello world


We will also use a library counter from the collections module

In [720]:
from collections import Counter

# Counter is used for counting
list_color = 'red red red green green red blue blue green yellow yellow'.split()
print('My colors: ', list_color)
c = Counter(list_color)
print('The most common color: ', c.most_common(1))
print('The two most common color:', c.most_common(2))
print('Color in my list ',list(c))
print('Count of colors in my list', list(c.values()))
print('Color in my list', list(c.keys()))
print('Item in my counter', list(c.items()))

My colors:  ['red', 'red', 'red', 'green', 'green', 'red', 'blue', 'blue', 'green', 'yellow', 'yellow']
The most common color:  [('red', 4)]
The two most common color: [('red', 4), ('green', 3)]
Color in my list  ['red', 'green', 'blue', 'yellow']
Count of colors in my list [4, 3, 2, 2]
Color in my list ['red', 'green', 'blue', 'yellow']
Item in my counter [('red', 4), ('green', 3), ('blue', 2), ('yellow', 2)]


In [721]:
# the goal here is to have the following dictionary
# {
#  word1: [(filename, count_in_filename), (filename, count_in_filename),..],
# word2: [(filename, count_in_filename), (filename, count_in_filename),..],
# ...
#}
dic = defaultdict(list)
for word in vocabulary:
    for filename in files:
        # the the word in filename
        # count the number of time a word appear in file
        c = Counter(file_to_voc[filename])
        c = dict(c)
        # this try/catch block handle the case where the dictionary 'c' doesn't have 
        # the word in our vocabulary
        try:
        # append th
            dic[word].append((filename, c[word]))
        except:
            pass
        
        

In [722]:
print(dic)

defaultdict(<class 'list'>, {'10': [('first_file_01.txt', 1)], '90': [('fourth_file_04.txt', 1)], 'a': [('first_file_01.txt', 2), ('fourth_file_04.txt', 1), ('second_file_02.txt', 2), ('third_file_03.txt', 2)], 'about': [('second_file_02.txt', 1), ('third_file_03.txt', 2)], 'accurately': [('fourth_file_04.txt', 1)], 'across': [('fifth_file_05.txt', 1)], 'actually': [('third_file_03.txt', 1)], 'admission': [('third_file_03.txt', 1)], 'affairs': [('first_file_01.txt', 1)], 'africanamerican': [('fifth_file_05.txt', 1)], 'agency': [('third_file_03.txt', 1)], 'all': [('fifth_file_05.txt', 1)], 'also': [('third_file_03.txt', 1)], 'america': [('first_file_01.txt', 1)], 'and': [('first_file_01.txt', 2), ('fourth_file_04.txt', 1), ('second_file_02.txt', 2)], 'antipoverty': [('fifth_file_05.txt', 1)], 'appears': [('third_file_03.txt', 1)], 'are': [('first_file_01.txt', 1), ('fourth_file_04.txt', 1)], 'area': [('fifth_file_05.txt', 1)], 'aren’t': [('fifth_file_05.txt', 1)], 'arguments': [('third_

In [723]:
# the goal here is to have the following dictionary
# {
#  word1: [count_in_file1, count_in_file2, count_in_file3, count_in_file4, count_in_file5],
#  word2: [count_in_file1, count_in_file2, count_in_file3, count_in_file4, count_in_file5],
# ...
#}
init_dic = {word: [0, 0, 0, 0, 0] for word in vocabulary}
for key in vocabulary:
    for elt in word_to_count_in_file[key]:
        root, ext = os.path.splitext(elt[0])
        root = root.split('_')
        root = [elt.lower().strip() for elt in root]
        root = [int(elt) - 1 for elt in root if not elt.isalpha()][0]
        init_dic[key][root] = elt[1] 

In [724]:
print(init_dic)

{'10': [1, 0, 0, 0, 0], '90': [0, 0, 0, 1, 0], 'a': [2, 2, 2, 1, 0], 'about': [0, 1, 2, 0, 0], 'accurately': [0, 0, 0, 1, 0], 'across': [0, 0, 0, 0, 1], 'actually': [0, 0, 1, 0, 0], 'admission': [0, 0, 1, 0, 0], 'affairs': [1, 0, 0, 0, 0], 'africanamerican': [0, 0, 0, 0, 1], 'agency': [0, 0, 1, 0, 0], 'all': [0, 0, 0, 0, 1], 'also': [0, 0, 1, 0, 0], 'america': [1, 0, 0, 0, 0], 'and': [2, 2, 0, 1, 0], 'antipoverty': [0, 0, 0, 0, 1], 'appears': [0, 0, 1, 0, 0], 'are': [1, 0, 0, 1, 0], 'area': [0, 0, 0, 0, 1], 'aren’t': [0, 0, 0, 0, 1], 'arguments': [0, 0, 1, 0, 0], 'armed': [0, 1, 0, 0, 0], 'army': [0, 1, 0, 0, 0], 'as': [0, 0, 1, 0, 0], 'assets': [0, 0, 1, 0, 0], 'at': [1, 0, 1, 1, 0], 'audit': [2, 0, 0, 0, 2], 'audited': [1, 0, 0, 0, 0], 'audits': [0, 0, 0, 0, 1], 'back': [0, 0, 1, 0, 1], 'be': [1, 0, 1, 1, 0], 'been': [0, 0, 0, 0, 1], 'began': [1, 0, 0, 0, 0], 'benefit': [1, 0, 0, 0, 0], 'board': [0, 0, 0, 0, 1], 'bottom': [0, 0, 0, 1, 0], 'budget': [0, 1, 0, 0, 1], 'buildings': [0, 0

In [726]:
# our search tool
def search_tool(text):
    
    result = {}
    text = preprocess_text(text)
    text = list(set(text))
    for elt in text:
        try:
            result[elt] = init_dic[elt]
        except:
            result[elt] = [0, 0, 0, 0, 0]
    return result

In [727]:
results = search_tool('why are you leaving us')

In [728]:
results

{'are': [1, 0, 0, 1, 0],
 'leaving': [0, 0, 0, 0, 0],
 'us': [0, 0, 0, 0, 0],
 'why': [0, 0, 1, 0, 0],
 'you': [0, 0, 0, 0, 0]}

### Printing statistics and OOP


In this section we will play a little bit object oriented programming

#### Creating an abstract class

An Abstract class is a class that cannot be instantiated, here we will create an abstract class, and review inheritance. here our TableFormatter is a superclass, and we have different way to output our statistics. In the console, and in the csv file. You can extend it by adding for instance html formatter.

In [729]:
from abc import ABC, abstractmethod

#Here is our 
class TableFormatter(ABC):
    """Our table formatter"""
    def __init__(self, outfile=None):
        if outfile == None:
            self.outfile = sys.stdout
        else:
            self.outfile = open(outfile, 'w')
        
    @abstractmethod
    def headings(self, headers):
        """abstract method this we will implement it in the subclasses
            it write the heading inside our file
        """
        pass
    
    @abstractmethod
    def row(self, rowdata):
        """abstract method, this we will implement it in the subclasses.
            It write the data inside our file
        """
        pass
    
    def close(self):
        """This method just close our file
        """
        if self.outfile:
            self.outfile.close()
            self.outfile = None

In [730]:
# our print table formatter
# this just print the data to the console
class PrintTableFormatter(TableFormatter):
    """A print table formatter"""
    def __init__(self, outfile=None, width=10):
        super().__init__(outfile)
        self.width = width
        
    def headings(self, headers):
        """Format the header, and write it on the console
        Args: headers: a list of string 
        Return: None
        """
        for header in headers:
            print(f'{header:>{self.width}}', end=' ', file=self.outfile)
        print(file=self.outfile)
        
    def row(self, rowdata):
        """Format the rowdata and write it on the console
        Args: rowdata: a list of data
        """
        for item in rowdata:
            print(f'{item:>{self.width}}', end=' ', file=self.outfile)
        print(file=self.outfile)

In [731]:
class CSVTableFormatter(TableFormatter):
    """A csv formatter
    """
    def headings(self, headers):
        """format the header in csv and write it in the file
            Args: headers: list of string
            Returns: None
        """
        print(','.join(headers), file=self.outfile)
    
    def row(self, rowdata):
        """Format the row in csv and write it in the file
        Args: rowdata: list of data
        Return: None
        """
        rowdata = [str(elt) for elt in rowdata]
        print(','.join(rowdata), file=self.outfile)


In [732]:
# print table
def print_table(objects, colnames, formatter):
    '''Print the element in object using the formatter
    :param objects: The object to print: dictionary of lists of elements
    :param colnames: The column names: list of strings
    :return: None
    '''
    if not isinstance(formatter, TableFormatter):
        raise TypeError('Expected a Table Formatter')
    
    # Emit header
    formatter.headings(colnames)
    
    # Emit row
    for elt in objects:
        rowdata = [elt]+ objects[elt]
        formatter.row(rowdata)       
        

In [733]:
# let text it, we will output the simple count of word inside the our file
colnames = ['word', 'file 1', 'file 2', 'file 3', 'file 4', 'file 5']
csv_formatter = CSVTableFormatter('hello.csv')
print_formatter = PrintTableFormatter()
input_string = 'the irs is auditing the bad guys.'
output = search_tool(input_string)
print_table(output, colnames, csv_formatter)
print_table(output, colnames, print_formatter)
print()
pprint(output)
csv_formatter.close()

      word     file 1     file 2     file 3     file 4     file 5 
        is          0          0          1          0          0 
       bad          0          0          0          0          0 
       the          8          8          7          6         12 
       irs          1          1          3          1          2 
      guys          0          0          0          0          0 
  auditing          0          0          0          0          0 

{'auditing': [0, 0, 0, 0, 0],
 'bad': [0, 0, 0, 0, 0],
 'guys': [0, 0, 0, 0, 0],
 'irs': [1, 1, 3, 1, 2],
 'is': [0, 0, 1, 0, 0],
 'the': [8, 8, 7, 6, 12]}


below is the screenshot of our csv file created

<img src='helo.png'/>

You can extend this by adding additional files to make you vocabulary more richier, printing additional statistics

## Part 2: Project 1: Steps and tips

After playing a little bit with python let's work on the project. This project will be easy if you rewrote and understood all the python code in the class, and the mini project we just did.

### Step 1:

Read carefully these sections inside the 'Project' module
    - Section 2(project description) 
    - Section 3 (project instruction)
    - Section 4 (workspace how to)

### Step 2: Time code

Read carefully <b>section 5</b> and implement <b>#TODO 0</b>. You can use the method 'convert_time' we implemented earlier or implement your own.

### Step 3: Command Line Argument
Read carefully <b>section 7</b> and implement <b>#TODO 1</b> inside <b>'get_input_args.py'</b>.
If you get stuck read the file <b>'get_input_args_hints.py'</b>

### Step 4: Creating Pet Image Labels
Read carefully <b>section 10</b> and implement <b>#TODO 2</b> inside <b>get_pet_labels.py</b> and <b>check_images.py</b>. If you get stuck, read <b>'get_pet_labels_hints.py'</b>. You can also be a little bit creative here and try to use a dictionary comprehension.

### Step 5: Classifying Images
Read carefully <b>section 12</b> and implement <b>#TODO 3</b> inside <b>classify_images.py</b> and <b>check_images.py</b>. If you get stuck, read <b>'classify_images_hints.py'</b>. You can also be a little bit creative here and try to use a list comprehension.

### Step 6: Classifying Labels as Dogs
Read carefully <b>section 14</b> and implement <b>#TODO 4</b> inside <b>check_images.py</b> and <b>adjust_results4_isadog.py</b>. If you get stuck, try to read <b>'adjust_results4_isadog_hints.py'</b>. You can also be a little bit creative here and try to use a list comprehension.

### Step 7: Calculating Results
Read carefully <b>section 16</b> and implement <b>#TODO 5</b> inside <b>check_images.py</b> and <b>calculates_results_stats.py</b>. If you get stuck, try to read the file <b>'adjust_results4_isadog_hints.py'</b>. You can also be a little bit creative here and try to use a list comprehension.

### Step 8: Printing The Results
Read carefully <b>section 18</b> and implement <b>#TODO 6</b> inside <b>check_images.py</b> and <b>print_results.py</b>. If you get stuck, read <b>'print_results_hints.py'</b>. You can also be a little bit creative here and try to use a list comprehension. 

### Step 9: Classify Uploaded Images
Read carefully <b>section 20</b> in the class and follow the instructions provided.

### Step 10: Results
Read carefully <b>section 22</b> in the class and follow the instructions provided.

### Step 11: Submit your project
Submit your project

## YAY!!!! CONGRATULATIONS!!! YOU MADE IT

Good job, you successfully completed your first project. You can call yourself an expert in python now. You successfully classified images using cutting-edge computer vision model architectures (ResNet, AlexNet, VGG). You also built a mini search engine.
Your excitement has just begun, next time we will cover Numpy, Pandas, and Matplotlib. We will use them to do some linear algebra, statistics, data manipulations, and beautiful visualizations.
Take a little break, do your favorite thing (maybe kiss or hug your partner/wife/husband)... See Y'all Next Friday :). It's an honor to be at your service.