# Lab 1 - Data Structures and Sorting


### Submission instructions
#### Turn in the lab via Canvas Assignments today (Thus, ) by 11:59pm today.
After completing this lab, you will turn in two files via Canvas ->  Assignments -> Homework 1:
Your Notebook, named si330-hw1-YOUR_UNIQUE_NAME.ipynb and
the HTML file, named si330-hw1-YOUR_UNIQUE_NAME.html

### Name:  YOUR NAME GOES HERE
### Uniqname: YOUR UNIQNAME GOES HERE
### People you worked with: [if you didn't work with anyone else write "I worked by myself" here].

## Objectives
After completing this homework assignment, you should know how to
* use compound data structures
* perform simple and complex sorting
* use lambda functions

In addition, this assignment will provide an opportunity to work with a large (100,000 row) data set.

## Background

Massive Open Online Courses (MOOCs) are a popular way for people to learn new skills.  The University of Michigan
offers many different MOOCs, which are produced by faculty members and supported by the Office of Academic 
Innovation.

MOOCs tend to be used by hundreds to hundreds of thousands of users.  These users leave "digital exhaust" when
they work through the MOOC in the form of web server log entries.  We have obtained a small sample of these data
files from Prof. Chris Brooks, who is a colleague here at UMSI.  The data files are de-identified: anything
that could identify a person, such as their UMID or their IP address are "hashed" (encrypted).  Each line in the
data file represents a "page view" by a user.  The schema for each line is:

```umich_user_id, hashed_session_cookie_id, server_timestamp, hashed_ip, user_agent, url, initial_referrer_url, rowser_language, course_id, country_cd, region_cd, timezone, os, browser, key, value```

**For this lab we will only concern ourselves with ```UMICH_USER_ID```, which identifies each user.**

We will use the files **mooc_small.csv** and **countrycodes.tsv** for this lab.

In the lab, we will go through the motions of some manipulation of the MOOC log data. These concepts would be tested in your homework assignment, where you will use these manipulations to answer some real world questions.

In [1]:
import csv
from collections import defaultdict

### Importing the data

The first step is to load the data. All the data about mooc usage and users are in the file mooc_small.csv. We also need the data from countrycodes.tsv to interpret the names of the country which are represented as country codes in ```mooc_small.csv```

We want to know for a given two digit ```country_cd``` in ```mooc_small.csv```, the complete name of the country. Hence, we will import the country codes into a dictionary.

<font color="red">**Q: Explain why a dictionary is the best data structure to store this?**</font>

Write your answer here...

In [2]:
### This chunk of code reads from a tab separated value file and stores the data in a dictionary
country_names = {}

with open("countrycodes.tsv", "r") as csvfile:
    reader = csv.DictReader(csvfile, delimiter = "\t", quotechar = '"')
    for row in reader:
        country_names[row['ISO ALPHA-2 Code']] = row['Country or Area Name']

Next we will import the data from ```mooc_small.csv``` into a list.

In [3]:
# Solution block: to be deleted before distribution
mooc_data_file_name = "mooc.csv" # Remember to change this later

mooc_data = []

with open(mooc_data_file_name, "r") as csvfile:
    reader = csv.DictReader(csvfile, delimiter = ",", quotechar = '"')
    for row in reader:
        mooc_data.append(row)

Read the **first 10 lines** from ```mooc_data```. We will output the ```user id```, ```country code``` and ```full name of the country```.
<font color="red">You will modify the next block of code to print.</font>

In [4]:
# Solution block: to be deleted before distribution
for row in mooc_data[:10]:
    print(row['umich_user_id'], row['country_cd'], country_names[row['country_cd']])

0ea5cc6ff0ca76782e6c0a81f070cae9cf0971d9 PF French Polynesia
0ea5cc6ff0ca76782e6c0a81f070cae9cf0971d9 PF French Polynesia
5450de1c9e1874d613a9649a39352a10313a3b8f IT Italy
0ea5cc6ff0ca76782e6c0a81f070cae9cf0971d9 PF French Polynesia
25424b1007637699cf0c672edc7a64c2b65268fa US United States of America
a95f04999ccf8fcd0f26fb0851745073a147e009 CZ Czech Republic
4ea0a18ab02a30290dda02bdb2da8a7a6a469245 US United States of America
25424b1007637699cf0c672edc7a64c2b65268fa US United States of America
44185055eece5d1bc7986d743d240a7633d968ff US United States of America
005aa91c779e6fe84b49398793dbda670dd6c352 NO Norway


### Manipulating the data

Next, we want to store the data into a data structure which will make it easier for us to perform operations on it later. We will create a dictionary of lists. 
<font color='red'>Using ```defaultdict```, store for each country a list of log entries. For now, we will only store the ```umich_user_id``` for the log.</font>

In [5]:
# Solution block: to be deleted before distribution
country_user_data = defaultdict(list)
for i in mooc_data:
    country_user_data[i['country_cd']].append(i['umich_user_id'])

### Filtering the data

We want to find out the number of different users overall, and from the US. For that we will first need to filter ```country_user_data``` to retrieve data from the US, and store it in ```us_user_data```. Since we are using a dictionary, this is relatively straightforward.

In [6]:
us_user_data = country_user_data['US']

### Operations on the data

To get the number of unique users in the data, we can do it in two different ways - **dictionaries** and **sets**.

Dictionaries are commonly used. The advantage of dictionaries is that each key will be that of an unique user, while the corresponding value to that key will allow you to store the number of logs for that user.

<font color="red">In the following code block you will use **dictionaries** to count unique users and the number of logs for each. Write down the code using ```defaultdict```.</font>

In [7]:
# Solution block: to be deleted before distribution
### Using Dictionaries
unique_user_data_overall = defaultdict(int)
unique_user_data_us = defaultdict(int)
for i in mooc_data:
    unique_user_data_overall[i['umich_user_id']] += 1
    if i['country_cd'] == 'US':
        unique_user_data_us[i['umich_user_id']] += 1

print("Total no. of unique users: ", len(unique_user_data_overall))
print("No. of unique users from the US: ", len(unique_user_data_us))

Total no. of unique users:  51
No. of unique users from the US:  25


Another way of counting unique users in using sets. In the following block you will create two sets to store the ids of users globally, and from the US. A set will only store unique ids, unlike a list.
<font color='red'>In the following chunk, we get a set of all the unique users globally. Write down the code to store in a separate set ```unique_us_users``` the set of users from the US.

In [8]:
# Solution block: to be deleted before distribution
### Using Sets
unique_users = set()
unique_us_users = set()
for i in mooc_data:
    unique_users.add(i['umich_user_id'])
    if i['country_cd'] == 'US':
        unique_us_users.add(i['umich_user_id'])

print("Total no. of unique users: ", len(unique_users))
print("No. of unique users from the US: ", len(unique_us_users))

Total no. of unique users:  51
No. of unique users from the US:  25


#### Average number of logs globally, and for users from the US.
Here we will write the code to get the average number of logs for each user. We can use the dictionaries that we created previously, ```unique_user_data_overall``` and ```unique_user_data_us``` to compute this.

In [9]:
# Solution block: to be deleted before distribution
global_mean_logs_per_user = sum(unique_user_data_overall.values())/len(unique_user_data_overall)
us_mean_logs_per_user = sum(unique_user_data_us.values())/len(unique_user_data_us)

print(global_mean_logs_per_user, us_mean_logs_per_user)

1.9607843137254901 1.76


### Sorting
Getting the ```user_id``` of the top 10 users who visited the most pages, globally and from the US. We will implement the ```sorted``` and pass a ```lambda``` function through the ```key``` parameter.

**<font color="red">Write down the code for the lambda function.</font>**

In [10]:
# Solution block: to be deleted before distribution
sorted_us_data = sorted(unique_user_data_us.items(), key = lambda x: x[1], reverse = True)

for i in range(5):
    print(sorted_us_data[i])

('c7e0b7e873392815abee61a53c231a1d5866a659', 6)
('1066a697903937dcb2bba46698b65c9067602b13', 4)
('44185055eece5d1bc7986d743d240a7633d968ff', 3)
('95cffc5948af183853735930299a0bc48c1cdc6c', 3)
('70d530b2e677aa82a680b36ba534dbabc884e010', 3)
