# Lab 7: MapReduce

Welcome to lab 7! This exercise is another free-form challenge, just like with lab 4 that was focused on the London 2012 athletes dataset. This time, I want you to see if you can answer some questions on a dataset, but only by using the MapReduce programming model.

<img src="mapreduce.jpg"/>

First, run the cell below to view a sample of 10 rows from the text file `nasa_access_log_aug95_sample.txt`. 

In [5]:
# The function islice thats the list of lines returned by the 
# open( ... ) command and returns a slice of only 10 of those lines.

from itertools import islice
with open('nasa_access_log_aug95_sample.txt') as file_pointer:
    for line in list(islice(file_pointer, 10)):
        print(line)

159.142.165.138 - - [15/Aug/1995:11:03:22 -0400] "GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0" 200 4179

134.131.38.18 - - [22/Aug/1995:13:43:38 -0400] "GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0" 200 4179

os2c14.aca.ilstu.edu - - [31/Aug/1995:21:47:11 -0400] "GET /shuttle/missions/sts-69/sts-69-patch-small.gif HTTP/1.0" 200 8083

suba01.suba.com - - [24/Aug/1995:04:48:23 -0400] "GET /htbin/wais.pl?TISP HTTP/1.0" 200 1349

146.138.145.170 - - [08/Aug/1995:16:30:51 -0400] "GET /shuttle/missions/sts-62/sts-62-patch-small.gif HTTP/1.0" 200 14385

pizza.innet.net - - [24/Aug/1995:18:22:52 -0400] "GET /history/apollo/images/apollo-logo1.gif HTTP/1.0" 200 1173

uplherc.upl.com - - [01/Aug/1995:00:00:10 -0400] "GET /images/WORLD-logosmall.gif HTTP/1.0" 304 0

205.129.171.133 - - [16/Aug/1995:14:13:00 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 1713

icenet.blackice.com.au - - [16/Aug/1995:07:52:55 -0400] "GET /history/apollo/images/apollo.gif HTTP/1.0

## The Challenge (part A)

Unlike the previous exercises, I have not provided you with a CSV file. This is a file that contains lines of text that is the format output by the Apache HTTP Server -- one of the most popular Web servers on the Internet -- where the lines are in a standardised format (see the [Common Log Format](https://en.wikipedia.org/wiki/Common_Log_Format) for details), but not comma-separated.

The first part of the challenge is to create a CSV file from this log file. As an example, I have written a few lines below that work on replacing ` - - ` with a comma `,`. You can use what you have alread learned about string replacement to try and create what you think is a sensible split of columns, split by commas.

In [18]:
from itertools import islice
with open('nasa_access_log_aug95_sample.txt') as file_pointer:
    for line in list(islice(file_pointer, 10)):
        # The following line simply takes the line read and does a string replacement
        print(line.replace(' - - ', ' ').replace(',',' ').replace('"','').replace('[',' ').replace(']',' ').split())

['159.142.165.138', '15/Aug/1995:11:03:22', '-0400', 'GET', '/shuttle/missions/sts-73/sts-73-patch-small.gif', 'HTTP/1.0', '200', '4179']
['134.131.38.18', '22/Aug/1995:13:43:38', '-0400', 'GET', '/shuttle/missions/sts-73/sts-73-patch-small.gif', 'HTTP/1.0', '200', '4179']
['os2c14.aca.ilstu.edu', '31/Aug/1995:21:47:11', '-0400', 'GET', '/shuttle/missions/sts-69/sts-69-patch-small.gif', 'HTTP/1.0', '200', '8083']
['suba01.suba.com', '24/Aug/1995:04:48:23', '-0400', 'GET', '/htbin/wais.pl?TISP', 'HTTP/1.0', '200', '1349']
['146.138.145.170', '08/Aug/1995:16:30:51', '-0400', 'GET', '/shuttle/missions/sts-62/sts-62-patch-small.gif', 'HTTP/1.0', '200', '14385']
['pizza.innet.net', '24/Aug/1995:18:22:52', '-0400', 'GET', '/history/apollo/images/apollo-logo1.gif', 'HTTP/1.0', '200', '1173']
['uplherc.upl.com', '01/Aug/1995:00:00:10', '-0400', 'GET', '/images/WORLD-logosmall.gif', 'HTTP/1.0', '304', '0']
['205.129.171.133', '16/Aug/1995:14:13:00', '-0400', 'GET', '/images/launch-logo.gif', 'H

Modify the code below to write out the `nasa_access_log_aug95_sample.csv` file with your string replacements to turn the input into a CSV file that can be read using `pandas`.

In [19]:
list_df=[]
from itertools import islice
with open('nasa_access_log_aug95_sample.txt') as input_file_pointer:
    with open('nasa_access_log_aug95_sample.csv', 'w') as output_file_pointer:
        for line in input_file_pointer:
            output_file_pointer.write("{line}".format(line=line.replace(' - - ', ',').replace('"','').replace('[',' ').replace(']','').split()))
            data=line.replace(' - - ', ',').replace(',',' ').replace('"','').replace('[',' ').replace(']','').split()
            if len(data)==8:
                list_df.append(data)

### For unknown reason read_csv wouldn't load the DataFrame, so I create a DataFrame strait from txt file in cell above list_df which is a list of lists of values, and in cell below DataFrame df1 from list_df

In [8]:
import pandas as pd
import numpy as np
#df = pd.read_csv('nasa_access_log_aug95_short.csv', error_bad_lines=False, warn_bad_lines=False)
#df.columns = ['Address', 'DateTime','Timezone','Method','File','HTTP','Status','Size']
#df

In [20]:
df1=pd.DataFrame(list_df,columns=['Address', 'DateTime','Timezone','Method','File','HTTP','Status','Size'])

In [21]:
df1.tail()

Unnamed: 0,Address,DateTime,Timezone,Method,File,HTTP,Status,Size
98334,gordon-nsc2.army.mil,07/Aug/1995:14:52:43,-400,GET,/shuttle/missions/sts-69/mission-sts-69.html,HTTP/1.0,200,11264
98335,novix.casi.sti.nasa.gov,23/Aug/1995:18:07:10,-400,GET,/history/apollo/apollo-4/apollo-4.html,HTTP/1.0,200,3542
98336,cerruti.idx.com,08/Aug/1995:14:19:28,-400,GET,/images/launch-logo.gif,HTTP/1.0,200,1713
98337,arcadia.ece.miami.edu,28/Aug/1995:15:41:58,-400,GET,/images/WORLD-logosmall.gif,HTTP/1.0,200,669
98338,204.149.228.71,18/Aug/1995:06:08:54,-400,GET,/shuttle/missions/sts-69/mission-sts-69.html,HTTP/1.0,200,11996


## Qusetion 1 -  Which files were most popular in terms of GET requests?
### Mapper function

In [193]:
Files_Map=[]
for row in df1.loc[:,['File','Method']].iterrows():
    Files_Map.append((row[1].values[0],row[1].values[1]))
Files_Map

[('/shuttle/missions/sts-73/sts-73-patch-small.gif', 'GET'),
 ('/shuttle/missions/sts-73/sts-73-patch-small.gif', 'GET'),
 ('/shuttle/missions/sts-69/sts-69-patch-small.gif', 'GET'),
 ('/htbin/wais.pl?TISP', 'GET'),
 ('/shuttle/missions/sts-62/sts-62-patch-small.gif', 'GET'),
 ('/history/apollo/images/apollo-logo1.gif', 'GET'),
 ('/images/WORLD-logosmall.gif', 'GET'),
 ('/images/launch-logo.gif', 'GET'),
 ('/history/apollo/images/apollo.gif', 'GET'),
 ('/history/mercury/mr-3/mr-3-patch-small.gif', 'GET'),
 ('/shuttle/missions/sts-70/movies/woodpecker.mpg', 'GET'),
 ('/history/apollo/apollo-11/images/69HC635.GIF', 'GET'),
 ('/shuttle/missions/sts-69/mission-sts-69.html', 'GET'),
 ('/images/WORLD-logosmall.gif', 'GET'),
 ('/shuttle/missions/sts-70/images/DSC-95EC-0001.gif', 'GET'),
 ('/shuttle/missions/sts-71/mission-sts-71.html', 'GET'),
 ('/history/gemini/gemini.html', 'GET'),
 ('/images/WORLD-logosmall.gif', 'GET'),
 ('/history/apollo/apollo-11/apollo-11.html', 'GET'),
 ('/images/WORL

### Reducer function

In [194]:
def Reducer_1(Map):
    Result={}
    for key in Map:
        if key[1]=='GET':
            if key[0] in Result.keys():
                Result[key[0]]+=1
            else:
                Result[key[0]]=1
    return Result

In [195]:
Q1Result=Reducer_1(Files_Map)
Q1Result

{'/shuttle/missions/sts-73/sts-73-patch-small.gif': 161,
 '/shuttle/missions/sts-69/sts-69-patch-small.gif': 1509,
 '/htbin/wais.pl?TISP': 22,
 '/shuttle/missions/sts-62/sts-62-patch-small.gif': 40,
 '/history/apollo/images/apollo-logo1.gif': 2378,
 '/images/WORLD-logosmall.gif': 4212,
 '/images/launch-logo.gif': 2194,
 '/history/apollo/images/apollo.gif': 65,
 '/history/mercury/mr-3/mr-3-patch-small.gif': 78,
 '/shuttle/missions/sts-70/movies/woodpecker.mpg': 70,
 '/history/apollo/apollo-11/images/69HC635.GIF': 10,
 '/shuttle/missions/sts-69/mission-sts-69.html': 1578,
 '/shuttle/missions/sts-70/images/DSC-95EC-0001.gif': 32,
 '/shuttle/missions/sts-71/mission-sts-71.html': 210,
 '/history/gemini/gemini.html': 95,
 '/history/apollo/apollo-11/apollo-11.html': 213,
 '/shuttle/missions/sts-71/movies/sts-71-launch.mpg': 140,
 '/icons/image.xbm': 624,
 '/ksc.html': 2782,
 '/images/KSC-logosmall.gif': 4917,
 '/images/NASA-logosmall.gif': 6193,
 '/shuttle/missions/sts-71/sts-71-patch-small.g

### Representation of Result sorted by value

In [196]:
import operator
sorted_Result=sorted(Q1Result.items(),key=operator.itemgetter(1),reverse=True)
sorted_Result[:10]

[('/images/NASA-logosmall.gif', 6193),
 ('/images/KSC-logosmall.gif', 4917),
 ('/images/MOSAIC-logosmall.gif', 4216),
 ('/images/WORLD-logosmall.gif', 4212),
 ('/images/USA-logosmall.gif', 4210),
 ('/images/ksclogo-medium.gif', 3942),
 ('/ksc.html', 2782),
 ('/history/apollo/images/apollo-logo1.gif', 2378),
 ('/images/launch-logo.gif', 2194),
 ('/images/ksclogosmall.gif', 1846)]

## Question 2 What day were the most HTTP requests made to the server?
### Mapper

In [197]:
Days_Map=[]
for x in df1['DateTime'].str[0:11]:
    Days_Map.append((x,1))
Days_Map

[('15/Aug/1995', 1),
 ('22/Aug/1995', 1),
 ('31/Aug/1995', 1),
 ('24/Aug/1995', 1),
 ('08/Aug/1995', 1),
 ('24/Aug/1995', 1),
 ('01/Aug/1995', 1),
 ('16/Aug/1995', 1),
 ('16/Aug/1995', 1),
 ('23/Aug/1995', 1),
 ('05/Aug/1995', 1),
 ('05/Aug/1995', 1),
 ('24/Aug/1995', 1),
 ('31/Aug/1995', 1),
 ('27/Aug/1995', 1),
 ('11/Aug/1995', 1),
 ('21/Aug/1995', 1),
 ('15/Aug/1995', 1),
 ('16/Aug/1995', 1),
 ('19/Aug/1995', 1),
 ('20/Aug/1995', 1),
 ('22/Aug/1995', 1),
 ('04/Aug/1995', 1),
 ('30/Aug/1995', 1),
 ('22/Aug/1995', 1),
 ('09/Aug/1995', 1),
 ('30/Aug/1995', 1),
 ('29/Aug/1995', 1),
 ('29/Aug/1995', 1),
 ('01/Aug/1995', 1),
 ('22/Aug/1995', 1),
 ('31/Aug/1995', 1),
 ('14/Aug/1995', 1),
 ('01/Aug/1995', 1),
 ('18/Aug/1995', 1),
 ('31/Aug/1995', 1),
 ('29/Aug/1995', 1),
 ('26/Aug/1995', 1),
 ('30/Aug/1995', 1),
 ('29/Aug/1995', 1),
 ('06/Aug/1995', 1),
 ('22/Aug/1995', 1),
 ('20/Aug/1995', 1),
 ('08/Aug/1995', 1),
 ('29/Aug/1995', 1),
 ('11/Aug/1995', 1),
 ('24/Aug/1995', 1),
 ('09/Aug/199

### Reducer function

In [198]:
def Reducer_2(Map):
    Result={}
    for key in Map:
        if key[0] in Result.keys():
            Result[key[0]]+=1
        else:
            Result[key[0]]=1
    return Result

In [199]:
Q2Result=Reducer_2(Days_Map)
Q2Result

{'01/Aug/1995': 2144,
 '03/Aug/1995': 2575,
 '04/Aug/1995': 3762,
 '05/Aug/1995': 1944,
 '06/Aug/1995': 2094,
 '07/Aug/1995': 3645,
 '08/Aug/1995': 3795,
 '09/Aug/1995': 3734,
 '10/Aug/1995': 3908,
 '11/Aug/1995': 3846,
 '12/Aug/1995': 2364,
 '13/Aug/1995': 2334,
 '14/Aug/1995': 3891,
 '15/Aug/1995': 3744,
 '16/Aug/1995': 3646,
 '17/Aug/1995': 3827,
 '18/Aug/1995': 3600,
 '19/Aug/1995': 2077,
 '20/Aug/1995': 2027,
 '21/Aug/1995': 3502,
 '22/Aug/1995': 3595,
 '23/Aug/1995': 3757,
 '24/Aug/1995': 3334,
 '25/Aug/1995': 3608,
 '26/Aug/1995': 2079,
 '27/Aug/1995': 2109,
 '28/Aug/1995': 3573,
 '29/Aug/1995': 4430,
 '30/Aug/1995': 5106,
 '31/Aug/1995': 5694}

In [200]:
import operator
sorted_Result=sorted(Q2Result.items(),key=operator.itemgetter(1),reverse=True)
sorted_Result[:10]

[('31/Aug/1995', 5694),
 ('30/Aug/1995', 5106),
 ('29/Aug/1995', 4430),
 ('10/Aug/1995', 3908),
 ('14/Aug/1995', 3891),
 ('11/Aug/1995', 3846),
 ('17/Aug/1995', 3827),
 ('08/Aug/1995', 3795),
 ('04/Aug/1995', 3762),
 ('23/Aug/1995', 3757)]

## Q3 -How many HTTP 200 responses were made?
### Map function

In [201]:
Response_Map=[]
for x in df1['Status']:
    Response_Map.append((x,1))
Response_Map

[('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('304', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('304', 1),
 ('304', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('304', 1),
 ('304', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('304', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('304', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('302', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('304', 1),
 ('200', 1),
 ('200', 1),
 ('304', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('304', 1),
 ('200', 1),

### Reducer Function

In [203]:
def Reducer_3(Map):
    Result={'200':0}
    for key in Map:
        if key[0]=='200':
            Result[key[0]]+=1
    return Result

In [204]:
Q3Result=Reducer_3(Response_Map)
Q3Result

{'200': 88851}

## Q4 How many other HTTP code reposnses were made?
### Map function


In [205]:
# Using the same map as in Q3
Response_Map

[('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('304', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('304', 1),
 ('304', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('304', 1),
 ('304', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('304', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('304', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('302', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('304', 1),
 ('200', 1),
 ('200', 1),
 ('304', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('200', 1),
 ('304', 1),
 ('200', 1),

### Reducer Function - using the same function as in Q2

In [207]:
Q4Response=Reducer_2(Response_Map)
Q4Response

{'200': 88851, '302': 1639, '304': 8596, '403': 12, '404': 644, '501': 2}

## Q5 What were the biggest, smallest and average file sizes served?
### Map function

In [208]:
Size_Map=df1['Size'].tolist()

### Reduce function

In [209]:
Q5Result=sorted(Size_Map)

In [239]:
# Max value
maximum=0
for value in Q5Result:
    if value[0] in '0123456789':
        if int(value)>=maximum:
            maximum=int(value)
print(maximum)

3155499


In [211]:
# Min value
for value in Q5Result:
    if value[0] in '0123456789':
        if int(value)>=0:
            print(value)
            break

0


In [212]:
def Reducer_3(Map):
    sumMap=0
    counter=0
    for value in Map:
        if value[0] in '0123456789':
            sumMap+=int(value)
            counter+=1
    return sumMap/counter
        

In [213]:
# Average file size
Reducer_3(Size_Map)

17458.765180179267

## The Challenge (part B)

By adding your own code in your own Jupyter Notebook cells below (you can add a cell by pressing the + button in the toolbar), try and answer some of the following questions about this data set:

- Which files were most popular in terms of `GET` requests?
- What day were the most HTTP requests made to the server?
- How many HTTP 200 (OK) responses were made?
- How many other HTTP code responses were made? Hint: here is a [list of HTTP response codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- What were the biggest, smallest and average file sizes served?

**Important: I want you to try and complete this exercise using the MapReduce programming model. If you find this too difficult, go ahead an use `pandas` anyway as this is still a very challenging lab.**

If you comfortably work out answers for all of these, feel free to add your own analyses!

When you're finished with lab 7 (or had completed what you can), choose **Save and Checkpoint** from the **File** menu, then choose **Download as Notebook** and save it to your computer or USB stick. You can then send a copy to the lecturer via Slack or email to check over.

In [214]:
#Most popular files among GET requests
df1.loc[:,['File','Method']].where(df1["Method"]=='GET').groupby('File').count().sort_values(by='Method',ascending=False).head(5)

Unnamed: 0_level_0,Method
File,Unnamed: 1_level_1
/images/NASA-logosmall.gif,6193
/images/KSC-logosmall.gif,4917
/images/MOSAIC-logosmall.gif,4216
/images/WORLD-logosmall.gif,4212
/images/USA-logosmall.gif,4210


In [86]:
df1['Date']=df1['DateTime'].str[0:11]
df1.head(5)

Unnamed: 0,Address,DateTime,Timezone,Method,File,HTTP,Status,Size,IP,IP Address,Country,Latitude,Longitude,City,Date
0,159.142.165.138,15/Aug/1995:11:03:22,-400,GET,/shuttle/missions/sts-73/sts-73-patch-small.gif,HTTP/1.0,200,4179,159.142.165.138,159.142.165.138,United States,38.8933,-77.0146,Washington,15/Aug/1995
1,134.131.38.18,22/Aug/1995:13:43:38,-400,GET,/shuttle/missions/sts-73/sts-73-patch-small.gif,HTTP/1.0,200,4179,134.131.38.18,134.131.38.18,United States,32.404,-86.2539,Montgomery,22/Aug/1995
2,os2c14.aca.ilstu.edu,31/Aug/1995:21:47:11,-400,GET,/shuttle/missions/sts-69/sts-69-patch-small.gif,HTTP/1.0,200,8083,92.242.132.15,92.242.132.15,United Kingdom,51.4964,-0.1224,,31/Aug/1995
3,suba01.suba.com,24/Aug/1995:04:48:23,-400,GET,/htbin/wais.pl?TISP,HTTP/1.0,200,1349,104.160.171.91,104.160.171.91,United States,34.0584,-118.278,Los Angeles,24/Aug/1995
4,146.138.145.170,08/Aug/1995:16:30:51,-400,GET,/shuttle/missions/sts-62/sts-62-patch-small.gif,HTTP/1.0,200,14385,146.138.145.170,146.138.145.170,United States,38.8933,-77.0146,Washington,08/Aug/1995


In [218]:
df1.groupby('Date').count().sort_values(by='Method',ascending=False)['Method'].head(5)

Date
31/Aug/1995    5694
30/Aug/1995    5106
29/Aug/1995    4430
10/Aug/1995    3908
14/Aug/1995    3891
Name: Method, dtype: int64

In [221]:
df1.where(df1['Status']=='200')['Status'].count()

88851

In [222]:
df1['Status'].unique()

array(['200', '304', '302', '404', '403', '501'], dtype=object)

In [235]:
df1['Size'].str.replace('-','0').astype(int).min()

0

In [237]:
df1['Size'].str.replace('-','0').astype(int).max()

3155499

In [238]:
df1['Size'].str.replace('-','0').astype(int).mean()

17301.58308269169

# Bonus Challenge

If you are feeling *really* adventurous, you can try using a Python library to do geographical-IP lookups to do some analyses. You will need to open up the command line and install the library called `geolite2`. To do this, open **Git Bash** and type the following:

```
$  pip install maxminddb-geolite2
```

Once `pip` has installed `geolite2`, if you restart Jupyter Notebook, you should be able to use it similar to as follows:

In [35]:
from geolite2 import geolite2
reader = geolite2.reader()
reader.get('1.0.0.0')

{'city': {'geoname_id': 2151718, 'names': {'en': 'Research'}},
 'continent': {'code': 'OC',
  'geoname_id': 6255151,
  'names': {'de': 'Ozeanien',
   'en': 'Oceania',
   'es': 'Oceanía',
   'fr': 'Océanie',
   'ja': 'オセアニア',
   'pt-BR': 'Oceania',
   'ru': 'Океания',
   'zh-CN': '大洋洲'}},
 'country': {'geoname_id': 2077456,
  'iso_code': 'AU',
  'names': {'de': 'Australien',
   'en': 'Australia',
   'es': 'Australia',
   'fr': 'Australie',
   'ja': 'オーストラリア',
   'pt-BR': 'Austrália',
   'ru': 'Австралия',
   'zh-CN': '澳大利亚'}},
 'location': {'accuracy_radius': 1000,
  'latitude': -37.7,
  'longitude': 145.1833,
  'time_zone': 'Australia/Melbourne'},
 'postal': {'code': '3095'},
 'registered_country': {'geoname_id': 2077456,
  'iso_code': 'AU',
  'names': {'de': 'Australien',
   'en': 'Australia',
   'es': 'Australia',
   'fr': 'Australie',
   'ja': 'オーストラリア',
   'pt-BR': 'Austrália',
   'ru': 'Австралия',
   'zh-CN': '澳大利亚'}},
 'subdivisions': [{'geoname_id': 2145234,
   'iso_code': 'VIC

The `reader.get( ... )` line takes an IP address and looks up the geographical information about it, and returns a Python dictionary. You can now select specific geographical information about the IP address. For example

In [2]:
# Get the country, in particular the English name
geo_dict = reader.get('1.1.1.1')
geo_dict['country']['names']['en']

'Australia'

In [3]:
# Get the continent, in particular the English name
geo_dict = reader.get('1.1.1.1')
geo_dict['continent']['names']['en']

'Oceania'

I have not tested this, so I will leave it to you to work out for yourselves if you take on this Bonus Challenge!

## Transform alpha addresses to IP addresses

In [12]:
import socket

In [39]:
for row in df1.index.values:
    if df1.loc[row,'Address'][0] in '0123456789':
        df1.loc[row,'IP Address']=df1.loc[row,'Address']
    else:
        try:
            df1.loc[row,'IP Address']=socket.gethostbyname(df1.loc[row,'Address'])
        except:
            df1.loc[row,'IP Address']=None

In [40]:
df1.head(5)

Unnamed: 0,Address,DateTime,Timezone,Method,File,HTTP,Status,Size,IP,IP Address
0,159.142.165.138,15/Aug/1995:11:03:22,-400,GET,/shuttle/missions/sts-73/sts-73-patch-small.gif,HTTP/1.0,200,4179,159.142.165.138,159.142.165.138
1,134.131.38.18,22/Aug/1995:13:43:38,-400,GET,/shuttle/missions/sts-73/sts-73-patch-small.gif,HTTP/1.0,200,4179,134.131.38.18,134.131.38.18
2,os2c14.aca.ilstu.edu,31/Aug/1995:21:47:11,-400,GET,/shuttle/missions/sts-69/sts-69-patch-small.gif,HTTP/1.0,200,8083,92.242.132.15,92.242.132.15
3,suba01.suba.com,24/Aug/1995:04:48:23,-400,GET,/htbin/wais.pl?TISP,HTTP/1.0,200,1349,104.160.171.91,104.160.171.91
4,146.138.145.170,08/Aug/1995:16:30:51,-400,GET,/shuttle/missions/sts-62/sts-62-patch-small.gif,HTTP/1.0,200,14385,146.138.145.170,146.138.145.170


### Get geo_dict using geolite2

In [78]:
for row in df1.index.values:
    if pd.notnull(df1.loc[row,'IP Address']):
        IP=df1.loc[row,'IP Address']
        try:
            geo_dict = reader.get(IP)
            df1.loc[row,'Country']=geo_dict['country']['names']['en']
            try:
                df1.loc[row,'City']=geo_dict['city']['names']['en']
            except:
                pass
            df1.loc[row,'Latitude']=geo_dict['location']['latitude']
            df1.loc[row,'Longitude']=geo_dict['location']['longitude']
        except:
            pass

In [79]:
df1.head()

Unnamed: 0,Address,DateTime,Timezone,Method,File,HTTP,Status,Size,IP,IP Address,Country,Latitude,Longitude,City
0,159.142.165.138,15/Aug/1995:11:03:22,-400,GET,/shuttle/missions/sts-73/sts-73-patch-small.gif,HTTP/1.0,200,4179,159.142.165.138,159.142.165.138,United States,38.8933,-77.0146,Washington
1,134.131.38.18,22/Aug/1995:13:43:38,-400,GET,/shuttle/missions/sts-73/sts-73-patch-small.gif,HTTP/1.0,200,4179,134.131.38.18,134.131.38.18,United States,32.404,-86.2539,Montgomery
2,os2c14.aca.ilstu.edu,31/Aug/1995:21:47:11,-400,GET,/shuttle/missions/sts-69/sts-69-patch-small.gif,HTTP/1.0,200,8083,92.242.132.15,92.242.132.15,United Kingdom,51.4964,-0.1224,
3,suba01.suba.com,24/Aug/1995:04:48:23,-400,GET,/htbin/wais.pl?TISP,HTTP/1.0,200,1349,104.160.171.91,104.160.171.91,United States,34.0584,-118.278,Los Angeles
4,146.138.145.170,08/Aug/1995:16:30:51,-400,GET,/shuttle/missions/sts-62/sts-62-patch-small.gif,HTTP/1.0,200,14385,146.138.145.170,146.138.145.170,United States,38.8933,-77.0146,Washington


### Countries and Cities with most logins

In [91]:
df1.loc[:,['Status','Country']].groupby('Country').count().sort_values(by='Status',ascending=False).head(10)

Unnamed: 0_level_0,Status
Country,Unnamed: 1_level_1
United Kingdom,56230
United States,27489
Canada,1859
Japan,1151
Germany,809
Australia,641
Sweden,436
Italy,351
Brazil,245
Republic of Korea,230


In [95]:
df1.loc[:,['Status','City','Country']].groupby(['City','Country']).count().sort_values(by='Status',ascending=False).head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Status
City,Country,Unnamed: 2_level_1
Huntsville,United States,7080
New York,United States,665
Cocoa,United States,526
Chesterfield,United States,458
Ashburn,United States,456
Titusville,United States,453
Montgomery,United States,446
Chicago,United States,418
Washington,United States,381
Denver,United States,380


### Graph showing Cities with most logins on map - dot size is the number of logins

In [141]:
USCities=df1.where(df1['Country']=='United States')
lat=USCities.loc[:,['Latitude','City']].groupby('City').mean().sort_index()['Latitude'].tolist()
lon=USCities.loc[:,['Longitude','City']].groupby('City').mean().sort_index()['Longitude'].tolist()
count_start=USCities.loc[:,['Longitude','City']].groupby('City').count().sort_index()['Longitude']
data_range=count_start.max()-count_start.min()
minimum=count_start.min()
count=(((count_start-minimum)/data_range)*20+4).tolist()

In [147]:
from bokeh.io import output_notebook, show
from bokeh.models import (
  GMapPlot, GMapOptions, ColumnDataSource, Circle, Range1d, PanTool, WheelZoomTool, BoxSelectTool
)

map_options = GMapOptions(lat=35, lng=-100, map_type="roadmap", zoom=4)

plot = GMapPlot(x_range=Range1d(), y_range=Range1d(), map_options=map_options)
plot.title.text = "America"

plot.api_key = ""

source = ColumnDataSource(data=dict(lat=lat,lon=lon,size=count))

circle = Circle(x="lon", y="lat", size="size", fill_color="blue", fill_alpha=0.8, line_color=None)
plot.add_glyph(source, circle)

plot.add_tools(PanTool(), WheelZoomTool(), BoxSelectTool())
output_notebook()
show(plot)

