# Introduction

This lesson introduces us to the following:
<ul>
    <li>Loading data from a file or database into Dataframes</li>
     <li>Combining data from multiple files into one source</li>
     <li>Using a map API to lookup locational data</li>
     <li>Plotting locations on a map give a lat/lon</li>
</ul>

This exercise uses fictional customer data to identify the location of your top customer.  You have a customer database with the following information: email, phone number, last login ip address.  When customers login to your website a log is updated with the login ip and the login email.  You want to combine the information in your log with your database to identify your top customer.

This exercise will show you how to load data, clean up duplicates, and combine the data from the two sources.  Then you will use a map API to identify the location of the phone number by area code and the ip address.  Using this information, you will map the possible locations your user lives in or logs in from.

# Functions

<font color="red">READ DIRECTIONS BELOW BEFORE PROCEEDING</font><br>
The following 6 cells contain functions for your use.  <br>
Execute the code in the cells but do <b>NOT</b> add or change anything since this could break the code!

<h3 style="color:blue;">Function 1</h3>

In [1]:
'''
Execute this code but do NOT modify!
'''
import numpy

def editDistance(str1, str2, thresh=0.7):
    '''
    Inputs:
    str1 = a string
    str2 = a string
    thresh = type float. this is an optional input.  
             it defines how close a match is required to determine str1 and str2 are similar 
    Function:
    This function compares how similar two strings are
    If the similarity meet the threshold defined by ratio, True is returned
    '''
    m = len(str1)
    n = len(str2)
    lensum = float(m + n)
    d = []           
    for i in range(m+1):
        d.append([i])        
    del d[0][0]    
    for j in range(n+1):
        d[0].append(j)       
    for j in range(1,n+1):
        for i in range(1,m+1):
            if str1[i-1] == str2[j-1]:
                d[i].insert(j,d[i-1][j-1])           
            else:
                minimum = min(d[i-1][j]+1, d[i][j-1]+1, d[i-1][j-1]+2)         
                d[i].insert(j, minimum)
    ldist = d[-1][-1]
    ratio = (lensum - ldist)/lensum
    
    if ratio >= thresh:
        return True
    else:
        return False

<h3 style="color:blue;">Function 2</h3>

In [2]:
'''
Execute this code but do NOT modify!
'''

def get_location(selector):
    '''
    Inputs:
    selector = phone or ip
    Function:
    This function returns a city and state and description of selector type
    To simply this exercise (and because we use phony values) we assume we know the city and state
    More advanced code would look up the location of the area codes and IP addresses
    '''
    if selector == 123456789:
        city = "los angeles"
        state = "california"
    else:
        city = "washington"
        state = "district of columbia"
    if "." in str(selector):
        desc = "ip"
    else:
        desc = "phone"
    return city, state, desc

<h3 style="color:blue;">Function 3</h3>

In [3]:
'''
Execute this code but do NOT modify!
'''

import requests
import json
import time

def latlon(city, state):
    '''
    Inputs:
    country = country where selector is found
    Function:
    This function returns a latitude and longitude for a US city,state pair
    '''
    
    base = "https://nominatim.openstreetmap.org/search?q="

    params = "{city}+{state}+united states&format=json&limit=1".format(city=city, state=state)

    url = "{base}{params}".format(base=base, params=params)
    
    response = requests.get(url)
    
    if response.status_code == requests.codes.ok:
        txt = response.text
        obj = json.loads(txt)
        
        if not obj[0] == []:
            lat = obj[0]['lat']
            lon = obj[0]['lon']
        else:
            lat = 0.0
            lon = 0.0

        return lat, lon
    
    else:
        print("Error", response.status_code)
        
        return 0,0

<h3 style="color:blue;">Function 4</h3>

In [4]:
'''
Execute this code but do NOT modify!
'''

import plotly as py
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

def make_map(df):
    '''
    Inputs:
    df = dataframe with columns for selector, lat, lon
    Function:
    This function generates a map
    '''
    
    scl = [ [0,"rgb(5, 10, 172)"],[0.35,"rgb(40, 60, 190)"],[0.5,"rgb(70, 100, 245)"],\
        [0.6,"rgb(90, 120, 245)"],[0.7,"rgb(106, 137, 247)"],[1,"rgb(220, 220, 220)"] ]

    data = [ dict(
            type = 'scattergeo',
            lon = df[df.columns[4]],
            lat = df[df.columns[3]],
            text = df[df.columns[2]],
            mode = 'markers',
            marker = dict(
                size = 8,
                opacity = 0.8,
                reversescale = True,
                autocolorscale = False,
                symbol = 'square',
                line = dict(
                    width=1,
                    color='rgba(102, 102, 102)'
                ),

            ))]

    layout = dict(
            title = 'Locations<br>(Hover for information)',
            colorbar = False,
            geo = dict(
            scope='usa',
            projection=dict( type='albers usa' ),
            showland = True,
            landcolor = "rgb(250, 250, 250)",
            subunitcolor = "rgb(217, 217, 217)",
            countrycolor = "rgb(217, 217, 217)",
            countrywidth = 0.5,
            subunitwidth = 0.5
           ),
        )

    fig = dict( data=data, layout=layout )
    init_notebook_mode(connected=True)
    iplot( fig, validate=False, filename='loc-map' )
   
    return


<h3 style="color:blue;">Function 5</h3>

In [5]:
import json

def df_to_geojson(df, properties, dataset, output_name, lat='latitude', lon='longitude'):
    """
    Turn a dataframe containing point data into a geojson formatted python dictionary
    
    df : the dataframe to convert to geojson
    properties : a list of columns in the dataframe to turn into geojson feature properties
    dataset : a variable name that describes the contents of the properties
    output_name : name of output file
    lat : the name of the column in the dataframe that contains latitude data
    lon : the name of the column in the dataframe that contains longitude data
    """
    
    # create a new python dict to contain our geojson data, using geojson format
    geojson = {'type':'FeatureCollection', 'features':[]}

    # loop through each row in the dataframe and convert each row to geojson format
    for _, row in df.iterrows():
        # create a feature template to fill in
        feature = {'type':'Feature',
                   'properties':{},
                   'geometry':{'type':'Point',
                               'coordinates':[]}}

        # fill in the coordinates
        feature['geometry']['coordinates'] = [row[lon],row[lat]]

        # for each column, get the value and add it as a new feature property
        for prop in properties:
            feature['properties'][prop] = row[prop]
        
        # add this feature (aka, converted dataframe row) to the list of features inside our dict
        geojson['features'].append(feature)
    
    geojson_dict = geojson
    geojson_str = json.dumps(geojson_dict, indent=2)

    # save the geojson result to a file
    output_filename = output_name
    with open(output_filename, 'w') as output_file:
        output_file.write('var {0} = {1};'.format(dataset, geojson_str))

    # how many features did we save to the geojson file?
    print('{0} geotagged features saved to file {1}'.format(len(geojson_dict['features']),output_filename))
    
    return 

<h3 style="color:blue;">Function 6</h3>

In [6]:
def combine_files(file1, file2):
    '''
    Combine two files into one, rename new file markers.js and delete the old.
    Inputs:
    - file1 = name of first file
    - file2 = name of second file
    '''
    #combine files
    f_in = open(file1, "r")
    data2 = f_in.read()
    f_in.close()
    f_out = open(file2, "a")
    f_out.write("\n"*2)
    f_out.write(data2)
    f_out.close()
    
    #rename conmbined file and delete other
    if os.path.isfile("./markers.js"):
        os.remove("markers.js")
        
    os.rename(file2, "markers.js")
    os.remove(file1)
    
    print("files combined to markers.js")
    
    return

# Student Exercise Starts Here

<b>Step 1</b>

You will need to use the Python package called os. <br>
This will let you interact with the operating system 
to look at files in your directory and get the current
working directory.<br>
<br>
Hint: use <b>import</b> to access the code in os.

In [7]:
import os

<b>Step 2</b>

listdir() is a function in os that list files in your current directory.  Use the dot operator notation to list all your files. <br> Make sure you see the following:
<ul>
<li>mbox.txt</li>
<li>Email_Database.xlsx</li>
<li>plotly</li>
<li>plotly.egg-info</li>
<li>template.html</li>
</ul>

In [8]:
os.listdir()

['plotly',
 '.ipynb_checkpoints',
 'markers.js',
 'template.html',
 'plotly.egg-info',
 'icons',
 'Email_Database.xlsx',
 'mbox.txt',
 'Maps_and_APIs.ipynb',
 'js']

<b>Step 3</b>

getcwd() is a function in os that returns the working directory.  Use this to set a variable called cwd to the current working directory then print out cwd to verify you did it right.  Notice the double backslashes (\\) used to separate folder and file names.

In [9]:
cwd = os.getcwd()
cwd

<b>Step 4</b>

create a variable called textFile. <br>
textFile should be a string.  Combine cwd (defined in step 3) with mbox.txt.  <br>Hint: Remember the double backslash.

Print textFile to make sure the path to the file looks correct.

In [10]:
textFile = cwd+"/mbox.txt"
textFile

<b>Step 5</b>
You will need to use a pandas DataFrame in the steps that follow.  Import pandas.  Use "as" to import it with the short name pd.

In [11]:
import pandas as pd

<b>Step 6</b>
Define a dataframe called df.  To do this you will need to use pd and DataFrame.  Don't forget the dot notation.<br>
You should also specify a column named email.  

Print the dataframe to verify you have an empty dataframe with one column called email.

In [12]:
df = pd.DataFrame(columns=["email"])
df

Unnamed: 0,email


<b>Step 7</b>

Use "with open" and .readlines() to open and read all the lines in textFile.<br>
Iterate through each line in the read in text using "for". <br>
Look for the email addresses of senders in the text file. <br>
Do this by looking to see if the line startswith() "From". <br>
If it does, split the line into words using split() then grab the email address listed right after "From"<br>
Store the email in the dataframe, df, you created in the step above.

Use df.head() to make sure you did everything correctly.

In [13]:
with open(textFile,"r") as file:
    lines = file.readlines()

for line in lines:
    if line.startswith("From"):
        words = line.split()
        email = words[1]
        df.loc[df.index.size] = email
        
df.head()

Unnamed: 0,email
0,DaForce@uct.ac.za
1,DaForce@uct.ac.za
2,Rogue1@media.berkeley.edu
3,Rogue1@media.berkeley.edu
4,DarthVader@umich.edu


<b>Step 8</b>

Use the pandas function value_counts() and sort=True to find the top emails in df. <br>
Name your variable that will store the counts freq. <br>
value_counts() takes an array as an argument. <br>
You will need to use df["email"].values to generate the necessary array of emails.

When you're done print out freq.head() to see the results.

In [14]:
freq = pd.value_counts(df["email"].values, sort=True)
freq.head()

DarthVader@umich.edu       390
skywalker@indiana.edu      322
princess.leia@iupui.edu    316
Ewok@iupui.edu             222
C-3PO@vt.edu               220
dtype: int64

<b>Step 9</b>

To grab the top email you will need to use freq.index to list the emails.  Then use .values[0] which will grab the first email listed in freq.index. <br>
Name your new variable top_email.

Print out top_email to make sure it corresponds to DarthVader.


In [15]:
top_email = freq.index.values[0]
top_email

'DarthVader@umich.edu'

<b>Step 10</b>

Open and read into a DataFrame the excel spreadsheet: Email_Database.xlsx.<br>
Use the pandas function read_excel() to do this. <br>
read_excel() takes the file name as the only argument you need.
Name your new DataFrame: df_emails

Use df_emails.head() to make sure it loaded right.


In [16]:
df_emails = pd.read_excel("Email_Database.xlsx")
df_emails.head()

Unnamed: 0,emails,phones,ips
0,AEuxDlmpAwxt,992647448936,290.52.283.8
1,AHfsnyRYiuTz,327480075452,352.77.713.4
2,AKmWMyZVAPIU,378132734350,366.62.777.7
3,ATvFStKMNVGe,177381034352,789.89.588.0
4,AWSZlaunHhEj,223418126068,711.00.456.8


<b>Step 11</b>

Notice that the email database you just loaded in the previous step only contains the username part of the email. <br>
We need to normalize top_email.  First split the email address on "@".  

Print top_email to see what it looks like.

In [17]:
top_email = top_email.split("@")
top_email

['DarthVader', 'umich.edu']

<b>Step 12</b>

Notice the username is the first index in the list you made in the previous step.<br>
Set top_email equal to this.

Print it out to make sure everything looks right.

In [18]:
top_email = top_email[0]
top_email

'DarthVader'

<b>Step 13</b>

Repurpose the DataFrame df.  Re-initialize it with columns = ['email','phone/ip']. <br>
Next iterate through df_emails using "for" and "iterrows()". <br>
Grab the email from the email column of the DataFrame.  <br>
Use the function editDistance (defined at Function 1 at the top of this notebook) to determine if an email is the same or similar to top_email. <br>
If editDistance returns True, add the email and corresponding phone/ip to df.

Print df when you're done to make sure it looks right.

In [19]:
df = pd.DataFrame(columns=['email','number'])

for idx, row in df_emails.iterrows():
    email = row[0]
    
    if editDistance(top_email,email):
        df.loc[df.index.size] = [row[0], row[1]]
        df.loc[df.index.size] = [row[0], row[2]]
        
df

Unnamed: 0,email,number
0,Darth.Vader,123456789
1,Darth.Vader,172.16.254.1
2,DarthVader,123456789
3,DarthVader,172.16.254.1
4,DarthVader1,123456789
5,DarthVader1,172.16.254.1
6,darthvader,123456789
7,darthvader,172.16.254.1


<b>Step 14</b>

Create two new columns in df called 'lat' and 'lon'. <br>
To do this you need to define a list with brackets [].  Multiple this list by df.index.size. <br>
Note df.index.size = number of rows in df.  You are basically generating two empty columns that you can then populate. <br>
Iterate through df using "for" and "iterrows()".  As you iterate, use the functions get_location and latlon (these functions are defined at the top of the notebook) to get the location of the selectors in phone/ip column. <br>
Add the latitude and longitude values returned by the latlon function to df.

Print df when you're done to make sure the last two columns populated correctly.

In [20]:
df['description'] = [None]*df.index.size
df['latitude'] = [None]*df.index.size
df['longitude'] = [None]*df.index.size
df['city'] = [None]*df.index.size
df['state'] = [None]*df.index.size

for idx, row in df.iterrows():
   
    city, state, desc = get_location(row[1])
    df.loc[idx]["city"] = city
    df.loc[idx]["state"] = state
    df.loc[idx]["description"] = desc
    
    lat,lon = latlon(city,state)
    df.loc[idx]["latitude"] = lat
    df.loc[idx]["longitude"] = lon
    
df 

Unnamed: 0,email,number,description,latitude,longitude,city,state
0,Darth.Vader,123456789,phone,34.0536909,-118.2427666,los angeles,california
1,Darth.Vader,172.16.254.1,ip,38.8948932,-77.0365529,washington,district of columbia
2,DarthVader,123456789,phone,34.0536909,-118.2427666,los angeles,california
3,DarthVader,172.16.254.1,ip,38.8948932,-77.0365529,washington,district of columbia
4,DarthVader1,123456789,phone,34.0536909,-118.2427666,los angeles,california
5,DarthVader1,172.16.254.1,ip,38.8948932,-77.0365529,washington,district of columbia
6,darthvader,123456789,phone,34.0536909,-118.2427666,los angeles,california
7,darthvader,172.16.254.1,ip,38.8948932,-77.0365529,washington,district of columbia


In [21]:
non_latlon_columns = ['email', 'number', 'description', 'city', 'state'] 

In [22]:
df_phones = df[df["description"] == "phone"]
df_phones

Unnamed: 0,email,number,description,latitude,longitude,city,state
0,Darth.Vader,123456789,phone,34.0536909,-118.2427666,los angeles,california
2,DarthVader,123456789,phone,34.0536909,-118.2427666,los angeles,california
4,DarthVader1,123456789,phone,34.0536909,-118.2427666,los angeles,california
6,darthvader,123456789,phone,34.0536909,-118.2427666,los angeles,california


In [23]:
df_ips = df[df["description"] == "ip"]
df_ips

Unnamed: 0,email,number,description,latitude,longitude,city,state
1,Darth.Vader,172.16.254.1,ip,38.8948932,-77.0365529,washington,district of columbia
3,DarthVader,172.16.254.1,ip,38.8948932,-77.0365529,washington,district of columbia
5,DarthVader1,172.16.254.1,ip,38.8948932,-77.0365529,washington,district of columbia
7,darthvader,172.16.254.1,ip,38.8948932,-77.0365529,washington,district of columbia


In [24]:
df_to_geojson(df_phones, properties=non_latlon_columns, dataset="phones", output_name="phones.js")

4 geotagged features saved to file phones.js


In [25]:
df_to_geojson(df_ips, properties=non_latlon_columns, dataset="ips",output_name="ips.js")

4 geotagged features saved to file ips.js


In [26]:
combine_files(file1="phones.js",file2="ips.js")

files combined to markers.js


<b>Step 15</b>

Use the function make_map (defined at top of notebook) to generate a map.  The only argument you need is your DataFrame, df.

Play with the map: zoom, rotate, hover.

Then for a nicer map open the html file in your browser.

In [27]:
make_map(df)

In [28]:
import webbrowser
url = "template.html"
_ = webbrowser.open(url)