# ENSF 544 Data Science for Software Engineers
**Assignment 1** - 100 marks

**Due:** September 20th, 12pm.


**IMPORTANT NOTE: each task must be implemented as asked, even if there are other easier or better solutions.**

**How to deliver:**
Edit this file and write your solutions in sections specified with `# Your solution`. Test your code and when you were done, submit this notebook as an `.ipynb` file to D2L dropbox. 



## Problem 1 - The Zipf mystery (50 points)

In this problem, we'd like to read the text from a book and perform some simple statistical analysis on the word counts. We have provided you with the actual text from [Lost On The Moon or, In Quest of the Field of Diamonds](https://www.goodreads.com/book/show/8636132-lost-on-the-moon-or-in-quest-of-the-field-of-diamonds) book in a file named 'the book.txt'. The file is cleaned up and only contains alphanumeric characters, i.e. no punctuation, quotation marks, etc.

Read the file and break it down to its words. (5 points)

In [76]:
def read_and_tokenize(file_name):
    openedFile = open(file_name, 'r') # open the file
    return openedFile.read().split() # then read it, splitting it by whitespace

words = read_and_tokenize('the book.txt')
words[1101:1111] # Expected: ['the', 'latter', 'picked', 'it', 'up', 'gazed', 'at', 'it', 'first', 'from']

['the', 'latter', 'picked', 'it', 'up', 'gazed', 'at', 'it', 'first', 'from']

Using a sorted list of unique words in the book. Store the list in a variable called `V`. Also complete the `get_word_index` function below that gets a word and finds its index within `V`. (5 points)

In [77]:
words = read_and_tokenize('the book.txt') # make the word list first
V = [] 
for i in words: # iterate through the word list...
    if i not in V: # and if it's not already in V...
        V.append(i) # add it
V.sort()

def get_word_index(word): 
    return V.index(word) # self-explanatory

get_word_index('about')  # Expected: 9

9

Using no loops, and by only using `map` and `filter` built-in python functions traverse through the `V` (vocabulary) list above to find:

* `long_words`: The list of words that have 10 letters or more 
* `no_vowels`: A list of all words but with vowels (aoeiu) removed. You can nest `map` and `filter` calls to iterate through the characters of the words.

(5+5 points)

In [78]:
def consonants_only(letter):
    the_vowels = ['a', 'e', 'i', 'o', 'u']
    if letter in the_vowels:
         return False
    else:
         return True

long_words = filter(len(V) >= 10, V)
no_vowels = map(consonants_only, V)

Create a python dictionary named `frequencies` (using defaultdict would make things easier). Use this dictionary to count the number of times each word has appeared in the book. For example `frequencies['about']` should store how many times the word "about" has been appeared in the book (165 times). (10 points)


In [79]:
# Your solution
from collections import defaultdict, Counter

def default_value():
    return 0

frequencies = defaultdict(default_value)
for i in words:
    frequencies[i] += 1

frequencies['about'] # Expected: 165.0

165

Find the word that appeared most frequently in the book. Find the word itself as well as the number of times it was repeated in the book. Use python's built-in max() function but define your own key using lambda function, i.e. do not iterate over the `frequencies` dictionary manually using a `for` loop. (5 points)

In [80]:
# Your solution 
most_common_word = max(frequencies, key = frequencies.get)
max_frequency = frequencies[most_common_word]

print(f'"{most_common_word}" is the most common word which has appeared {max_frequency} times in the book.')
# Expected: "the" is the most common word which has appeared 3237 times in this book.

"the" is the most common word which has appeared 3237 times in the book.


Normalize all frequency values by dividing them by the maximum frequency value (using map). After this the most common word in the book should get a normalized frequency of `1` and all other words get some value 
between `1/MAX` and `1`. (2.5 points)

In [81]:
# Your solution

def normalize(num):
 return 1/max_frequency

normalized_frequencies = map(normalize, frequencies)

We want to check if the normalized frequencies have any correlation to their ranks. If such correlation exists, the Zipf's law states that it is linear in a log-log space. Take the logarithm of normalized frequencies (as x values) and create a list of the same size containing the rank of each word (as y values). For example if the frequencies are `[0.1, 0.1, 1, 0.01, 0.0001]` the x and y values will be `Y = [2, 2, 1, 4, 5] X=[-1, -1, 0, -2, -4]`.
(Note that same normalized values should have the same rank)

You might want to sort the normalized frequencies first to make the task easier.  (2.5 points)

In [124]:
from math import log

y = list(normalized_frequencies).sort()
x = [log(i) for i in y]

TypeError: 'NoneType' object is not iterable

Calculate the [pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) on this data. The result is expected to be close to -1. Define the missing variables in the pcc function for the the statistical calculations as necessary (using python's built-in functions like map and sum).  (2.5+2.5+5 points)

In [None]:
def pcc(x, y):
  n = len(x)
  sum_x = float(sum(x))
  sum_y = float(sum(y))

  ### Your solution goes here ###
    
  sum_x_sq = sum(x**2) # sum of squared x 
  sum_y_sq = sum(y**2) #sum of squared y
  i = 0
  psum = 0
  while (x[i] != None):
    psum += x[i] * y[i] #dot product of x,y using map
  
  #### your solution ends here### 
  num = psum - (sum_x * sum_y/n)
  den = pow((sum_x_sq - pow(sum_x, 2) / n) * (sum_y_sq - pow(sum_y, 2) / n), 0.5)
  if den == 0: return 0
  return num / den
    
pcc(x, y)

TypeError: object of type 'map' has no len()

Additionally, you can use `pearsonr` function from `scipy` package to check if the calculated value is definitely correct. Though if you get a value close enough to -1 you can almost be sure that your implementation is correct and this step won't be necessary.

In [None]:
#run this cell to see if your pcc function is correct
import numpy as np
from scipy.stats import pearsonr

x = np.log(np.sort(normalized_frequencies)[::-1])
y = np.arange(1, 1+len(V))

pearsonr(x,y)

## Problem 2 - Log processing (50 points)

In this part of the assignment we are going to use regular expressions to mine data out of some webserver log files. Although these problems can be solved without use of RegExes, but for this assignment you need to use them.

A sample web server log file is provided along with this problem. In each line of the file one event is recorded. For simplicity all of the events in this file have the same format and are of the same type. Each event contains an ip address, date and time of the event, http method (`GET` or `POST`), a url, HTTP version, HTTP response code (usually 200), the response size in bytes, and the device's user agent which contains information about the device such as the brand and the operating system.

Since these logs have such a well defined format regular expressions are the prefect tool for breaking them down into parts and perform different analysis on them.

**Please make sure that when you are asked to write a function that _return_s something, you are _return_ing that value, not just _print_ing it**

We start off with a random log line and write python functions that use regular expressions to break it off to pieces.

In [84]:
import re

l = '5.106.145.204 - - [04/Sep/2019:13:51:39 +0430] "POST /v1/crash-report/incident/report/ HTTP/1.1" 200 65 "-" "Dalvik/1.6.0 (Linux; U; Android 4.2.2; GT-S7272 Build/JDQ39)"'
print(l)

5.106.145.204 - - [04/Sep/2019:13:51:39 +0430] "POST /v1/crash-report/incident/report/ HTTP/1.1" 200 65 "-" "Dalvik/1.6.0 (Linux; U; Android 4.2.2; GT-S7272 Build/JDQ39)"


Make a function that extracts the ip address part of the log line using regular expressions. (5 points)

In [92]:
def get_ip_address(l):
    return re.search("[0-9]+.[0-9]+.[0-9]+.[0-9]+", l)


get_ip_address(l)  # Expected: '5.106.145.204'


<re.Match object; span=(0, 13), match='5.106.145.204'>

Make a function that extracts the operating system name and version using regular expressions. (5 points)

Your Answer need to be general for the log file. (There are Windows, Linux, and Android operating systems in the log file)

In [104]:
def get_os(l):
    return re.search("(Windows |Linux |Android )(.+?);", l)
get_os(l) #Expected: 'Android 4.2.2'

<re.Match object; span=(133, 147), match='Android 4.2.2;'>

Make a function that extracts the HTTP method, url, response code, and response size and returns a tuple. Use regular expressions. The http method is either `POST` or `GET` and the response code is always a 3 digit integer. (10 points)

In [116]:
def get_http_info(l):
    return re.search("(POST|GET).*([0-9]{3})([0-9]+)", l)

get_http_info(l)  # Expected: ('POST', '/v1/crash-report/incident/report/', 200, 65)
# Please note that the last two numbers are converted to integers

<re.Match object; span=(48, 156), match='POST /v1/crash-report/incident/report/ HTTP/1.1" >

Use regular expressions to break the date and time section apart and create a python datetime object based on that. Mind the time zone. convert the datetimes to MDT. Using `strptime` is a better solution in general, but for this assignment please stick to writing RegExes so you become more comfortable in writing and debugging them. (15 points)


In [123]:
from datetime import datetime, timedelta, timezone
from calendar import month_abbr

MDT = timezone(timedelta(minutes=-6*60 + 0))

def month_num(name):
    return {'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4, 'May': 5, 'Jun': 6, 'Jul': 7, 'Aug': 8, 'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec':12}[name]

def get_datetime(l):
    return re.search("\([0-9]+)-([0-9]+)-([0-9])+T([0-9]+):([0-9]+):([0-9]+)]", l)

x = get_datetime(l)

y = datetime(x.group(3), month_num(x.group(2)), x.group(1), x.group(4), x.group(5), x.group(6))

y.isoformat()  # Expected: '2019-09-04T03:21:39-06:00'


error: unbalanced parenthesis at position 8

Read the log file line by line and use the `get_os` functions above to find all unique operating systems (including version) and print the sorted results (5 points)


In [None]:

unique_sorted_os_list = []
openedFile = open("log.txt", 'r')
openedLines = openedFile.readlines()

for line in openedLines:
    if (line not in unique_sorted_os_list)
        unique_sorted_os_list.append(line)

unique_sorted_os_list.sort()

print('\n'.join(unique_sorted_os_list))



Read the log file line by line and use the `get_datetime` and `get_http_info` functions above to calculate the used bandwidth of the server (the sum of all the response sizes) per hour. Use a `dict` or a `defaultdict`. (10 points)

For example if there are 4 logs like:

    Sep 4 14:20 .... 65bytes
    Sep 4 14:35 .... 80bytes
    Sep 4 15:01 .... 44bytes
    Sep 5 18:20 .... 40bytes

The result will be like:

    Sep 4 14:00  145
    Sep 4 15:00  44
    Sep 5 18:00  40

In [125]:
bandwidth = dict()
list_bandwidths = []

openedFile = open("log.txt", 'r')
openedLines = openedFile.readlines()

for line in openedLines:
    x = get_datetime(line)
    y = get_http_info(line)
    bandwidths = dict(
        "month": month_num(x.group(2))
        "day": x.group(1)
        "hour": x.group(4)
        "bandwith": y.group(3) # the 2nd number from the HTTP read
    )
    list_bandwidths.append(bandwidth)
    for (band in list_bandwidths) {
        if (band.hour = (band + 1).hour)
            list_bandwidths[band].bandwith += list_bandwidths[band + 1].list_bandwidths
            delete list_bandwitdths[band + 1]

    }

error: unbalanced parenthesis at position 8