# COMP-240 Homework 4

## Downloading a dataset and basic data wrangling

In this exercise you will download a dataset from a given url and perform some basic data wrangling. 

The dataset `store_data.csv` captures sales logs in `.csv` format from a department store with multiple branches. A sample of the dataset is shown below:

<pre>
date 	    store	item	sales
2017-02-21  5	    3	    26
2015-10-04	7	    37	    20
2017-03-09	5	    22	    55
2014-05-17	6	    4	    35
2017-01-18	4	    41	    22
2013-09-15	3	    48	    67
2016-02-15	7	    47	    10
2015-04-08	2	    25	    107
2013-01-02	1	    4	    11
2013-06-14	10	    22	    9
...
</pre>

As one can immediately see an `item` may be found in multiple entries in the dataset. The dataset can be accessed through the following url: 

`https://storage.googleapis.com/comp240-stores/store_data.csv`

Your goal is to download the dataset and go through it to find the **top sold item by aggregating the daily sales per item**.

### Pure Python and the Requests Library

If you have downloaded anaconda but did not pre-download any libraries then open a terminal (e.g., for windows open command prompt) and download the libraries via the conda package manager:

`conda install requests`

Download the dataset using the `requests` library and then go through the dataset to find the **top sold item by aggregating the daily sales per item** without importing any other library, so basically just pure python.

In [None]:
import requests

In [None]:
# your code goes here
url = 'https://storage.googleapis.com/comp240-stores/store_data.csv'
sales_per_item = {}

r = requests.get(url)

if r:
    try:
        data = r.content.decode('utf-8').split('\n')
        for line in data[1:]:
            columns = line.split(',')
            if columns[2] in sales_per_item:
                sales_per_item[columns[2]] += int(columns[3])
            else:
                sales_per_item[columns[2]] = int(columns[3])
    except IndexError:
        pass
else:
    r.text

In [None]:
sales_per_item.items()

In [None]:
max(sales_per_item.items(), key=(lambda k: k[1]))

# Other solution
# max_key = max(sales_per_item, key=sales_per_item.get)
# print(f'The top sold item is item nr. {max_key} with {sales_per_item[max_key]} sold units.')

## Regular Expressions

In [None]:
import re

**Ex. 2.1** Write a function that uses a regex to test if a given string simply starts with a '>'.

Examples:

<pre>
symbol_check('>abc') -> True
symbol_check('> abc') -> True
symbol_check('>abc14') -> True
symbol_check('abc') -> False
symbol_check('abc>') -> False
</pre>

In [None]:
def symbol_check(s):
    #your code goes here
    pat = r'^>'
    return re.search(pat, s) is not None

    # pat = r'>'
    # return re.match(pat, s) is not None

In [None]:
print(symbol_check('>abc'))
print(symbol_check('> abc'))
print(symbol_check('>abc14'))
print(symbol_check('abc'))
print(symbol_check('abc>'))

**Ex. 2.2** Write a function that uses a regex to test if a given string is comprised of only two sequences of lowercase letters joined together with an underscore.

Examples:

<pre>
low_und('abbbbbc_ddddd') -> True
low_und('dkfkf_d') -> True
low_und('w_a') -> True
low_und('aA_a') -> False
low_und('_a') -> False
low_und('aaaa_') -> False
low_und('a9aa_a') -> False
low_und('a9aa _a') -> False
</pre>

In [None]:
def low_und(s):
    #your code goes here
    pat = r'[a-z]+_[a-z]+'
    return re.search(pat, s) is not None

In [None]:
print(low_und('abbbbbc_ddddd'))
print(low_und('dkfkf_d'))
print(low_und('w_a'))
print(low_und('aA_a'))
print(low_und('_a'))
print(low_und('aaaa_'))
print(low_und('a9aa_a'))
print(low_und('a9aa _a'))

**Ex. 2.3** Write a function `find_words_with_zz(text)` that uses a regex to find the words in a given text that contain double 'zz'. The double 'zz' cannot be in the start and end of the word.

Examples:

<pre>
>> find_words_with_zz("""Suddenly in the night, John woke up 
                         craving pizza and a sizzling piece of bacon 
                         topped with a drizzle of bbq sauce!""")
['pizza', 'sizzling', 'drizzle']
</pre>

In [None]:
def find_words_with_zz(text):
    # your code goes here
    pat = r'\b(?!zz)\w+zz\w+(?!zz)\b'
    return re.findall(pat, text)

In [None]:
# I have added 2 words in the first line starting with 'zz' and ending with 'zz'
find_words_with_zz("""Suddenly in the night, John woke up zzippe zzzzippe pizz pizzzz
                      craving pizza and a sizzling piece of bacon 
                      topped with a drizzle of bbq sauce!""")

**Ex. 2.4** Write a function `filter_words_end_ie(word_list)` that uses a regex to filter (select) the words from a given list that end with `ie`. 

Examples:

<pre>
>> filter_words_end_ie(['hello', 'bye', 'goalie', 'newbie', 'zero', 'rieb', 'zombie', 'zoom', 'ierapetra'])
['goalie', 'newbie', 'zombie']
</pre>

In [None]:
def filter_words_end_ie(word_list):
    # your code goes here
    result = []
    pat = r'ie$'
    for word in word_list:
        if re.search(pat, word) is not None:
            result.append(word)
    return result

In [None]:
filter_words_end_ie(['hello', 'bye', 'goalie', 'newbie', 'zero', 'rieb', 'zombie', 'zoom', 'ierapetra'])

**Ex. 2.5** Write a function `replace_zero(text)` that uses a regex to replace any reference of `zero` or `Zero` with `0`.

Examples:

<pre>
>> replace_zero("""After studying the data, we draw the conclusion that zero evidence has been brought forward for the incident at Zero zero three zone that was full of zeros""")

'After studying the data, we draw the conclusion that 0 evidence has been brought forward for the incident at 0 0 three zone that was full of 0s'
</pre>

In [None]:
def replace_zero(text):
    # your code goes here
    pat = r'Zero|zero'
    return re.sub(pat, '0', text)

In [None]:
replace_zero("""After studying the data, we draw the conclusion that zero evidence has been brought forward for the incident at Zero zero three zone that was full of zeros""")

**Ex. 2.6** Write a function `find_phone_numbers(text, code='357')` that uses a regex to find and extract all the Cypriot phone numbers in a given text sequence. A valid Cypriot phone number starts with either 00 or + followed by the country code 357 and afterwards, 8 numeric digits.

Examples:

<pre>
>> find_phone_numbers("""dkdkkdldld +35799394903 dkdkfk 0035799802358 fkfkfld.\n dldld;s;sdd 
                         dkdk +30690040404 and +3579933040""")
['+35799394903', '0035799802358']
</pre>

In [None]:
def find_phone_numbers(text):
    # your code goes here
    pat = r'(?:\+|00)357\d{8}'
    return re.findall(pat, text)

In [None]:
find_phone_numbers("""dkdkkdldld +35799394903 dkdkfk 0035799802358 fkfkfld.\n dldld;s;sdd 
                      dkdk +30690040404 and +3579933040""")

**Ex. 2.7** Write a function `is_zip_code(zipcode)` that uses a regex to test if a given string evaluates to a correct zip code. A correct zip code has a length of 6 characters with the 3 first being letters and followed by 3 digits.

Examples:

<pre>
is_zip_code('ABC123') -> True
is_zip_code('FYA598') -> True
is_zip_code('wyz940') -> True
is_zip_code('AB123') -> False
is_zip_code('ABCD123') -> False
is_zip_code('ABC12') -> False
</pre>

In [None]:
def is_zip_code(zipcode):
    # your code goes here
    pat = r'\b[a-zA-Z]{3}\d{3}\b'
    return re.search(pat, zipcode) is not None

In [None]:
print(is_zip_code('ABC123'))
print(is_zip_code('FYA598'))
print(is_zip_code('wyz940'))
print(is_zip_code('AB123'))
print(is_zip_code('ABCD123'))
print(is_zip_code('ABC12'))

**Ex. 2.8** Write a function `is_timestamp(tstamp)` that uses a regex to test if a given string evaluates to a correct 24-hour timestamp that adopts the HH:MM:SS format.

Examples:

<pre>
is_timestamp('00:00:00') -> True
is_timestamp('01:59:47') -> True
is_timestamp('23:59:01') -> True
is_timestamp('11:32:05') -> True
is_timestamp('24:00:00') -> False
is_timestamp('09:62:01') -> False
is_timestamp('9:05:01') -> False
</pre>

In [None]:
def is_timestamp(tstamp):
    # your code goes here
    pat = r'\b([01][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9]\b'
    return re.search(pat, tstamp) is not None

In [None]:
print(is_timestamp('00:00:00'))
print(is_timestamp('01:59:47'))
print(is_timestamp('23:59:01'))
print(is_timestamp('11:32:05'))
print(is_timestamp('24:00:00'))
print(is_timestamp('09:62:01'))
print(is_timestamp('9:05:01'))