# Women Who Code :: Build-Your-Own Dataset

Sometimes data scientists are handed a fully prepared and cleaned dataset, but this is rarely the case. Today's workshop will give you practice in building your own dataset from scratch. We will use public APIs and publically available data files to create a dataset of weather and population data that is ready for downstream uses.

In this workshop, we'll be collecting and organizing information for fictional visitors to our fictional 


In [1]:
from faker import Faker
import pandas as pd
import requests

# Part 1 :: Calling Public APIs

In this first section we will use several publically available APIs to collect information about fictional visitors to our website. The only information we directly collect about visitors is their IP address. Beyond that, we'll have to look to outside sources to pull in information to learn more about our visitors.

## Get your IP address

An IP address is a unique address that identifies a device on the internet or a local network. IP stands for "Internet Protocol," which is the set of rules governing the format of data sent via the internet or local network.

To find out your own IP address, you can make a call to the [ipify](https://www.ipify.org/) API, a simple public IP address API. This API does not require an account or API key.

In [2]:
# api endpoint
url = "https://api.ipify.org"

# request formatted response
params = {
    "format": "json"
}

resp = requests.get(url, params)

Congrats, you just made a call to the first API of this workshop! Let's take a closer look at the response, and see what information we've collected from it.

In [3]:
# http response status codes indicate whether the request has been successfully completed
resp.status_code

200

In [4]:
# can get response in a string format
resp.text

'{"ip":"69.136.161.247"}'

In [5]:
# or more usefully as a json dictionary
resp.json()

{'ip': '69.136.161.247'}

In [6]:
# let's hold on to ip address, and use it in some next steps
ip = resp.json()["ip"]

## Get location for an IP address

IP addresses can be linked to information about the location where you are connected to the internet.

To find geolocation information given an IP address, we can use the [ip-api](https://ip-api.com/) JSON endpoint. The IP address endpoint allowed us to pass the desired response format (JSON) as a query parameter, but this API has a specific JSON endpoint, so we'll specify the data format as part of the URL.

In [7]:
url = f"http://ip-api.com/json/{ip}"

resp = requests.get(url)

In [8]:
# check that the call succeeded
resp.status_code

200

In [9]:
# inspect the returned data
resp.json()

{'status': 'success',
 'country': 'United States',
 'countryCode': 'US',
 'region': 'IL',
 'regionName': 'Illinois',
 'city': 'Chicago',
 'zip': '60610',
 'lat': 41.9032,
 'lon': -87.6383,
 'timezone': 'America/Chicago',
 'isp': 'Comcast Cable Communications, LLC',
 'org': 'Comcast Cable Communications, Inc.',
 'as': 'AS7922 Comcast Cable Communications, LLC',
 'query': '69.136.161.247'}

In [10]:
# pull out the lat/long fields, since we can look up info about this location

lat = resp.json()["lat"]
long = resp.json()["lon"]

In [11]:
def get_location_info(ip: str):
    """
    Given an IP address, return a dictionary of location information.
    """
    url = f"http://ip-api.com/json/{ip}"
    resp = requests.get(url)
    return resp.json()

## Get local weather

Now that we know where a visitor is from, we can collect any information from other sources to understand the user's location. For this workshop, let's say that our website has content related to weather, and so the visitor's current weather is of interest to us.

Given the visitor's latitude and longitude, we can use the [Open Meteo](https://open-meteo.com/en) API to get information about the location's current weather. Like the other APIs we've used in this workshop, Open Meteo is public and does not require an API key.



In [12]:
url = "https://api.open-meteo.com/v1/forecast"

params = {
    "latitude": lat,
    "longitude": long,
    "current_weather": True,
    "format": "json"
}

resp = requests.get(url, params)

In [13]:
resp.status_code

200

In [14]:
resp.json()

{'elevation': 182.125,
 'generationtime_ms': 0.21398067474365234,
 'current_weather': {'temperature': 12.5,
  'winddirection': 261,
  'time': '2022-04-21T15:00',
  'windspeed': 14.8,
  'weathercode': 2},
 'latitude': 41.875,
 'longitude': -87.625,
 'utc_offset_seconds': 0}

In [15]:
def get_weather_info(lat: float, long: float):
    """
    Given a latitude and longitude, return the current weather.
    """
    url = "https://api.open-meteo.com/v1/forecast"

    params = {
        "latitude": lat,
        "longitude": long,
        "current_weather": True,
        "format": "json"
    }

    resp = requests.get(url, params)
    return resp.json()


## Generate fake IP addresses

Since we are working with fake data, we'll have to create some fake IP addresses for our website visitors. To 

In [16]:
# example of how Faker works
faker = Faker()
faker.ipv4() 


'88.166.86.65'

In [17]:
# make a list of all our IP addresses
fake_ips = [faker.ipv4() for _ in range(100)]
fake_ips[0:5]

['108.214.245.99',
 '207.144.42.212',
 '95.39.255.178',
 '39.13.26.200',
 '119.187.227.137']

In [18]:
def get_geo_weather_data(ip: str):
    
    # get location info for the ip address
    location_info = get_location_info(ip)
    
    # get the current weather at the lat/long
    weather_info = get_weather_info(location_info["lat"], location_info["lon"])
    
    # stack the dictionaries
    # only works in python 3.9 and above
    all_data = {"ip": ip} | location_info | weather_info
    
    # alternative for lower python versions using unpacking
    # all_data = {**ip_info, **location_info, **weather_info}
    
    return all_data

In [19]:
get_geo_weather_data(fake_ips[0])

{'ip': '108.214.245.99',
 'status': 'success',
 'country': 'United States',
 'countryCode': 'US',
 'region': 'FL',
 'regionName': 'Florida',
 'city': 'Orlando',
 'zip': '32801',
 'lat': 28.5436,
 'lon': -81.3738,
 'timezone': 'America/New_York',
 'isp': 'AT&T Services, Inc.',
 'org': 'AT&T Corp',
 'as': 'AS7018 AT&T Services, Inc.',
 'query': '108.214.245.99',
 'current_weather': {'windspeed': 21.4,
  'time': '2022-04-21T15:00',
  'temperature': 24.1,
  'weathercode': 3,
  'winddirection': 81},
 'generationtime_ms': 0.1239776611328125,
 'elevation': 31.96875,
 'longitude': -81.375,
 'latitude': 28.5,
 'utc_offset_seconds': 0}

# Generate Dataset

In [20]:
def get_fake_geo_weather_data():
    fake_ip = faker.ipv4()
    try:
        return get_geo_weather_data(fake_ip)
    except:
        return None

In [23]:
weather_dicts = [get_fake_geo_weather_data() for i in range(10)]

# remove any missing values
weather_dicts = [x for x in weather_dicts if x is not None]

In [24]:
df_weather = pd.DataFrame(weather_dicts)

In [25]:
df_weather.head()

Unnamed: 0,ip,status,country,countryCode,region,regionName,city,zip,lat,lon,...,isp,org,as,query,generationtime_ms,elevation,utc_offset_seconds,current_weather,latitude,longitude
0,110.62.126.130,success,China,CN,BJ,Beijing,Haidian,,39.8997,116.334,...,China TieTong Telecommunications Corporation,North Star Information Hi.tech Ltd. Co,AS9394 China TieTong Telecommunications Corpor...,110.62.126.130,0.187993,60.96875,0,"{'time': '2022-04-21T15:00', 'windspeed': 19.7...",39.875,116.375
1,160.82.191.104,success,United States,US,IL,Illinois,Chicago,60666,41.8781,-87.6298,...,Deutsche Bank AG,Deutsche Bank AG,,160.82.191.104,0.123024,182.125,0,"{'temperature': 12.5, 'winddirection': 261, 't...",41.875,-87.625
2,93.81.99.251,success,Russia,RU,SAR,Saratovskaya Oblast,Saratov,410000,51.5391,45.9985,...,CORBINA-BROADBAND,,AS8402 PJSC Vimpelcom,93.81.99.251,0.180006,36.71875,0,"{'windspeed': 14.6, 'weathercode': 2, 'winddir...",51.5,46.0
3,23.160.235.36,success,United States,US,VA,Virginia,Centreville,20120,38.8397,-77.4335,...,American Registry Internet Numbers,American Registry Internet Numbers,,23.160.235.36,0.128031,103.3125,0,"{'winddirection': 185, 'windspeed': 17.4, 'tim...",38.875,-77.375
4,106.172.240.113,success,Japan,JP,40,Fukuoka,Fukuoka,810-0001,33.5902,130.402,...,Kddi Corporation,Kddi Corporation,AS2516 KDDI CORPORATION,106.172.240.113,2.39706,23.21875,0,"{'temperature': 12.8, 'winddirection': 178, 't...",33.625,130.375


In [26]:
df_weather["country"].unique()

array(['China', 'United States', 'Russia', 'Japan', 'Brazil',
       'South Korea'], dtype=object)

# Join with Migration Data

XLSX file available from the UN

https://population.un.org/wpp/Download/Standard/Migration/

In [None]:
df_migration = pd.read_excel("WPP2019_MIGR_F01_NET_MIGRATION_RATE.xlsx", skiprows=range(16))

In [None]:
df_migration.head()

In [None]:
df_migration.tail()

In [None]:
# limit to only country-level records
df_migration_countries = df_migration.loc[df_migration["Type"]=="Country/Area"]

df_migration_countries.head()

In [None]:
df_migration_subset = df_migration_countries[["Region, subregion, country or area *", "2015-2020"]]

In [None]:
df_migration_subset.head()

In [None]:
df_migration_subset.rename({"Region, subregion, country or area *": "country", "2015-2020": "migration_rate"}, axis=1, inplace=True)

In [None]:
df_migration_subset.head()

In [None]:
df_weather_migration = df_weather.merge(df_migration_subset, how="left", left_on="country", right_on="country")

In [None]:
df_weather_migration.head()

In [None]:
# some countries didn't match because names are different

df_weather_migration.loc[df_weather_migration["migration_rate"].isna(), "country"].unique()

In [None]:
# a few countries did match up

df_weather_migration.loc[df_weather_migration["migration_rate"].notna(), "country"].unique()

In [None]:
df_migration_subset.loc[df_migration_subset["country"].str.contains("United States")]

In [None]:
df_migration_subset.loc[df_migration_subset["country"].str.contains("Syria")]

NameError: name 'df_migration_subset' is not defined

In [None]:
df_migration_subset.loc[df_migration_subset["country"].str.contains("Korea")]

NameError: name 'df_migration_subset' is not defined

In [None]:
df_migration_subset.loc[df_migration_subset["country"].str.contains("Russia")]

NameError: name 'df_migration_subset' is not defined

In [None]:
df_migration_subset.loc[df_migration_subset["country"].str.contains("Venezuela")]

NameError: name 'df_migration_subset' is not defined

In [None]:
df_migration_subset.loc[df_migration_subset["country"].str.contains("Viet")]

NameError: name 'df_migration_subset' is not defined

In [None]:
df_migration_subset.loc[df_migration_subset["country"].str.contains("Taiwan")]

NameError: name 'df_migration_subset' is not defined

In [None]:
# to get join to work, let's rename country in the migration dataset

df_migration_subset["country"].replace(to_replace={
    "United States of America": "United States", 
    "Syrian Arab Republic": "Syria", 
    "Russian Federation": "Russia", 
    "Republic of Korea": "South Korea",
    "Venezuela (Bolivarian Republic of)": "Venezuela",
    "Viet Nam": "Vietnam",
    "China, Taiwan Province of China": "Taiwan",
    
}, inplace=True)

NameError: name 'df_migration_subset' is not defined

In [None]:
# try the join again
df_weather_migration = df_weather.merge(df_migration_subset, how="left", left_on="country", right_on="country")

In [None]:
df_weather_migration.head()

In [None]:
# now all of the records should have a match

df_weather_migration.loc[df_weather_migration["migration_rate"].isna(), "country"].unique()