The rest of the notebook contains specific steps that you need to follow and complete.  Places where you need to 
do something are indicated in <font color="magenta">magenta</font>.

First, let's load up the ```csv``` library; we're going to need it to read the comma- and tab-separated 
values files.

In [3]:
import csv
from collections import defaultdict

#### Step 1. Import the data

You'll load the data from the two files ```mooc_big.csv``` and ```countrycodes.tsv``` into two separate 
data structures. 

Let's start with ```countrycodes.tsv```.  Remember, we're going to use that file to map from the 
2-digit country code to the country name (e.g. from "CA" to "Canada").  




 <font color="magenta">Modify the next block of code so that it loads ```countrycodes.tsv``` into a data structure
    that would allow you to look up the country name that corresponds to the 2-digit country code.</font>

In [4]:
country_names = {}

with open("countrycodes.tsv", "r") as csvfile:
    reader = csv.DictReader(csvfile, delimiter = "\t", quotechar = '"')
    for row in reader:
        country_names[row['ISO ALPHA-2 Code']] = row ['Country or Area Name']




Now load the MOOC log data into an appropriate data structure (start with the mooc_small.csv file, then remember to change to mooc_big.csv). For this file, you should store all the rows in a data structure.


 <font color="magenta">Modify the next block of code so that it loads the MOOC log data into a data structure 
   that will allow you to answer the three real-world questions.</font>

In [10]:
mooc_data_file_name = "mooc.csv"

mooc_data = [] #data structure

with open(mooc_data_file_name, "r") as csvfile:
    reader = csv.DictReader(csvfile, delimiter = ",", quotechar = '"')
    for row in reader:
        mooc_data.append(row) # Change this line to populate the data structure you created above with data from the file


bb116a9af763fa2f53139c1dfa851c760a667169 DE Germany


#### Step 2. Manipulating and interpreting the data to answer our questions

Now that we have our data loaded, we can start to answer the real-world questions.

Recall that the first question
is <b>"How many different countries (based on COUNTRY_CD) are represented in the data file?"</b>

To do this, you're going to have to figure out how many unique country codes there are in the MOOC log file. There are few different ways to do this, but you probably want to use either a ```set``` or a ```dict```.

<font color="magenta">Modify the following code block so that the print statement at the end prints
    the number of countries represented in the MOOC log file.</font>

In [23]:
# figure out how many unique country codes there are in the MOOC log file

countries = defaultdict(int) # data structure

for row in mooc_data:
    code = row['country_cd']
    countries[code]+=1
    
print (countries)
print (len(countries))


defaultdict(<class 'int'>, {'PF': 5, 'IT': 2, 'US': 44, 'CZ': 3, 'NO': 4, 'DE': 8, 'EG': 1, 'BY': 5, 'AU': 5, 'PS': 2, 'PL': 1, 'GD': 2, 'CA': 10, 'BR': 1, 'CN': 1, 'JM': 2, 'JP': 2, 'LB': 1, 'IQ': 1})
19


Next you want to find out the <b>top 5 countries with the most page views</b>. For this, you should implement a composite data structure which stores, for each country, details of each log - ```UMICH User ID``` and ```hashed_session_cookie_id```. There are different ways that you can do this. One way would be by using a ```dictionary of lists```. Think about how you would populate this list.

After that you will sort the data structure using ```sorted```. You will need to write down the code to provide the ```sorted``` function with a key parameter using the ```lambda``` function. This will specify the operation to be performed on the data structure for sorting (what the data structure will be sorted by).

<font color="magenta">Modify the following code block so that the print statement at the end prints
    the top 5 of countries represented in the MOOC log file, and the corresponding number of users.</font>

In [75]:



##After, sort the data structure using sorted
country_user_data = defaultdict(list) #appropriate data structure

for i in range(len(mooc_data)):
    ## for each country, UMICH User ID & hashed_session_cookie_id
    user_id = mooc_data[i]['umich_user_id']
    cookie_id = mooc_data[i]['hashed_session_cookie_id']
    code = mooc_data[i]['country_cd']
    
    info =(user_id,cookie_id)
    
    country_user_data[code].append(info)#code that will populate your data structure
    

# print(country_user_data)

# # #Write down the code for the lambda function to sort 'country_user_data' by the number of users from that country.
sorted_country_user_data = sorted(country_user_data.items(), key= lambda x:len(x[1]), reverse =True) 

# print(sorted_country_user_data)

# Do not change the following lines of code. 
# This should output the top 5 countries, along with the number of users from each of those countries.
for i in range(5):
    print(country_names[sorted_country_user_data[i][0]], len(sorted_country_user_data[i][1]))

United States of America 44
Canada 10
Germany 8
French Polynesia 5
Belarus 5


From this step on, you will be working on ```country_user_data``` data structure.

Here, you will need to <b>filter the data so you only have entries from the US (i.e. where COUNTRY_CD is US)</b>. You need to retrieve the number of logs for a user, for each session i.e. which have the same ```hashed_session_cookie_id```.

From ```country_user_data``` data structure retrieve the entries from US. Using ```defaultdict``` you should count the number of logs (number of rows) in a session ```hashed_session_cookie_id``` into a new data structure. The number of logs/rows will give you the number of pages the user has viewed in one session.

<font color="magenta">Modify the following code block so the data structure us_data contains only the entries for people from the US.</font>

In [88]:
us_data = defaultdict(int) #Change none to the appropriate data structure
for row in country_user_data['US']:
    us_data[row[1]]+=1 #store the number of log entries per session in us_data
   

Now, you need to calculate the <b>average number of pageviews per session</b> for users in the US. ```numpy``` which will be covered later has an in-built method. For now, you will iterate over the values, sum them up, and divide by the number of values. Recall ```sum``` and ```len``` methods in python.

<font color="magenta">In the following block of code put in the formula for calculating the average.</font>

In [80]:

counter = 0
for item in us_data:
     counter = counter + (us_data[item])     

avg_page_views_per_session = (counter / (len(us_data)) ) # Put in the code to get the average number of page views per session for MOOC users from the US
print(avg_page_views_per_session)

1.76


Finally, you want to <b>sort the sessions to retrieve the ones have maximum number of logs</b>. Implement a ```sorted``` function, pass the appropriate ```lambda``` function to the ```key``` parameter and store it into the data structure ```sorted_us_data```.

<font color="magenta">In the following block, write down the code for the sorted function. The print statement should output the top 5 hashed_session_cooke_id and the corresponding number of logs for that session.</font>

In [87]:
sorted_us_data = sorted(us_data.items(), key=lambda x:x[1], reverse =True)   # Change this line to include a sorted function.

for i in range(5):
    print(sorted_us_data[i]) #This will print out the top 5 sessions with their hashed_session_cookie_id and no. of log entries

('d8fe83dbeba4af9b001d3ad8f8aa8940b40e06ce', 6)
('9431b24e18b18ea6b5aea81920abd33fb9ce55ee', 4)
('c13cb2cdb6e7ebbc4e1e434a29e449e221f3c5d3', 3)
('e0f1598cc697187a9ab35f12562f7ad7ce2dcc2a', 3)
('85bf1f93b06602d828147c5b2ffabb066e63c4b1', 3)


In [98]:
#the top 5 countries with the highiest amount of unique country codes in the MOOC log file

sorted_mooc_countries = sorted(countries.items(), key = lambda x:x[1], reverse=True)
for i in range(5):
    print(sorted_mooc_countries[i])
    country = sorted_mooc_countries[i][0]
    country_dict =defaultdict(int)
    for row in country_user_data[country]:
        us_data[row[1]]+=1
    print (country_dict)



('US', 44)
defaultdict(<class 'int'>, {})
('CA', 10)
defaultdict(<class 'int'>, {})
('DE', 8)
defaultdict(<class 'int'>, {})
('PF', 5)
defaultdict(<class 'int'>, {})
('BY', 5)
defaultdict(<class 'int'>, {})
