# Women Who Code :: Build-Your-Own Dataset

Sometimes data scientists are handed a fully prepared and cleaned dataset, but this is rarely the case. Today's workshop will give you practice in building your own dataset from scratch. We will use public APIs and publically available data files to create a dataset of weather and population data that is ready for downstream uses.

In this workshop, we'll be collecting and organizing information for fictional visitors to our fictional 


In [1]:
from faker import Faker
import pandas as pd
import requests

# Part 1 :: Calling Public APIs

In this first section we will use several publically available APIs to collect information about fictional visitors to our website. The only information we directly collect about visitors is their IP address. Beyond that, we'll have to look to outside sources to pull in information to learn more about our visitors.

## Get your IP address

An IP address is a unique address that identifies a device on the internet or a local network. IP stands for "Internet Protocol," which is the set of rules governing the format of data sent via the internet or local network.

To find out your own IP address, you can make a call to the [ipify](https://www.ipify.org/) API, a simple public IP address API. This API does not require an account or API key.

In [2]:
# api endpoint
url = "https://api.ipify.org"

# request formatted response
params = {
    "format": "json"
}

resp = requests.get(url, params)

Congrats, you just made a call to the first API of this workshop! Let's take a closer look at the response, and see what information we've collected from it.

In [3]:
# http response status codes indicate whether the request has been successfully completed
resp.status_code

200

In [4]:
# can get response in a string format
resp.text

'{"ip":"69.136.161.247"}'

In [5]:
# or more usefully as a json dictionary
resp.json()

{'ip': '69.136.161.247'}

In [6]:
# let's hold on to ip address, and use it in some next steps
ip = resp.json()["ip"]

## Get location for an IP address

IP addresses can be linked to information about the location where you are connected to the internet.

To find geolocation information given an IP address, we can use the [ip-api](https://ip-api.com/) JSON endpoint. The IP address endpoint allowed us to pass the desired response format (JSON) as a query parameter, but this API has a specific JSON endpoint, so we'll specify the data format as part of the URL.

In [7]:
url = f"http://ip-api.com/json/{ip}"

resp = requests.get(url)

In [8]:
# check that the call succeeded
resp.status_code

200

In [9]:
# inspect the returned data
resp.json()

{'status': 'success',
 'country': 'United States',
 'countryCode': 'US',
 'region': 'IL',
 'regionName': 'Illinois',
 'city': 'Chicago',
 'zip': '60610',
 'lat': 41.9032,
 'lon': -87.6383,
 'timezone': 'America/Chicago',
 'isp': 'Comcast Cable Communications, LLC',
 'org': 'Comcast Cable Communications, Inc.',
 'as': 'AS7922 Comcast Cable Communications, LLC',
 'query': '69.136.161.247'}

In [10]:
# pull out the lat/long fields, since we can look up info about this location

lat = resp.json()["lat"]
long = resp.json()["lon"]

In [11]:
def get_location_info(ip: str):
    """
    Given an IP address, return a dictionary of location information.
    """
    url = f"http://ip-api.com/json/{ip}"
    resp = requests.get(url)
    return resp.json()

## Get the local weather

Now that we know where a visitor is from, we can collect any information from other sources to understand the user's location. For this workshop, let's say that our website has content related to weather, and so the visitor's current weather is of interest to us.

Given the visitor's latitude and longitude, we can use the [Open Meteo](https://open-meteo.com/en) API to get information about the location's current weather. Like the other APIs we've used in this workshop, Open Meteo is public and does not require an API key.



In [12]:
url = "https://api.open-meteo.com/v1/forecast"

params = {
    "latitude": lat,
    "longitude": long,
    "current_weather": True,
    "format": "json"
}

resp = requests.get(url, params)

In [13]:
resp.status_code

200

In [14]:
resp.json()

{'elevation': 182.125,
 'latitude': 41.875,
 'current_weather': {'winddirection': 288,
  'windspeed': 15.6,
  'weathercode': 1,
  'temperature': 2.9,
  'time': '2022-04-26T13:00'},
 'generationtime_ms': 0.14591217041015625,
 'longitude': -87.625,
 'utc_offset_seconds': 0}

In [15]:
def get_weather_info(lat: float, long: float):
    """
    Given a latitude and longitude, return the current weather.
    """
    url = "https://api.open-meteo.com/v1/forecast"

    params = {
        "latitude": lat,
        "longitude": long,
        "current_weather": True,
        "format": "json"
    }

    resp = requests.get(url, params)
    return resp.json()


# Part 2 :: Prepare Your Dataset

## 2.1 :: Generate fake IP addresses

Since we are working with fake data, we'll have to create some fake IP addresses for our website visitors. To do this, we'll use a package called [Faker](https://faker.readthedocs.io/en/master/) which generates fake data. It can generate all types of fake data, ranging from addresses to names to, wouldn't you know it, ID addresses!

In [20]:
# faker generator
faker = Faker()

# create a fake ip address
faker.ipv4()

'172.207.40.57'

## 2.2 :: Generate weather data

For each visitor IP address, we'll want to run our full weather collection process of getting location based on IP, then weather based on location. One way to do this is to define a function that takes in a (fake) IP address, hits the IP-to-location API, then sends this response to the Location-to-weather API.

In [39]:
def get_geo_weather_data(ip: str):
    
    # get location info for the ip address
    location_info = get_location_info(ip)
    
    # get the current weather at the lat/long
    weather_info = get_weather_info(location_info["lat"], location_info["lon"])
    
    # stack the dictionaries
    # only works in python 3.9 and above
    all_data = {"ip": ip} | location_info | weather_info
    
    # alternative for lower python versions using unpacking
    # all_data = {**ip_info, **location_info, **weather_info}
    
    return all_data

By organizing all of the API calls into a single function, this allows us to write a simple function that:

1. Makes a fake IP address
2. Gets the location and weather data for that IP adress
3. Handles the case where we don't get back valid weather data (for example, an API returned an error)

In [44]:
def get_fake_geo_weather_data(max_retries=5):
    # keep trying again until we either get a valid result, or hit the max number of retries
    retries = 0
    while retries <= max_retries:
        fake_ip = faker.ipv4()
        # we won't always get successful results from each IP
        try:
            return get_geo_weather_data(fake_ip)
        # for now, we can skip any failed attempts
        except:
            retries += 1
    print ("Max retries reached!")
    return None
        

To handle potential API errors, we allowed our function to return a value of `None` in cases where no valid data was returned after the maximum number of retries. To clean up the data and make it easier for analysis, we can drop these failed attempts from our list of weather data responses.

In [48]:
weather_dicts = [get_fake_geo_weather_data() for i in range(10)]

# remove any missing values, in case max retries was hit at any point
weather_dicts = [x for x in weather_dicts if x is not None]

Pandas dataframes are a standard across many data science teams, and so we will convert this list of dicts to a DataFrame for downstream analysis and data validation.

In [49]:
# list of dicts to pandas dataframe is easy!
df_weather = pd.DataFrame(weather_dicts)

In [50]:
# take a peek to make sure the data looks as we'd expect it to
df_weather.head()

Unnamed: 0,ip,status,country,countryCode,region,regionName,city,zip,lat,lon,...,isp,org,as,query,longitude,utc_offset_seconds,generationtime_ms,elevation,current_weather,latitude
0,8.95.7.7,success,United States,US,LA,Louisiana,Monroe,71203.0,32.5896,-92.0669,...,"Level 3 Communications, Inc.","Level 3, LLC","AS3356 Level 3 Parent, LLC",8.95.7.7,-92.125,0,2.327919,23.09375,"{'time': '2022-04-26T18:00', 'weathercode': 1,...",32.625
1,164.104.181.14,success,United States,US,CO,Colorado,Fort Collins,80525.0,40.5377,-105.0546,...,Poudre School District R-1,Poudre School District R,AS54060 Poudre School District R-1,164.104.181.14,-105.0,0,0.120997,1504.0,"{'temperature': 16.8, 'winddirection': 157, 't...",40.5
2,56.100.209.164,success,United States,US,IL,Illinois,Chicago,60666.0,41.8781,-87.6298,...,United States Postal Service.,United States Postal Service,,56.100.209.164,-87.625,0,0.123978,182.125,"{'weathercode': 2, 'windspeed': 17.3, 'winddir...",41.875
3,155.207.132.190,success,Greece,GR,B,Central Macedonia,Thessaloniki,,40.6381,22.9455,...,Aristotle University of Thessaloniki,AUTH,AS5470 Aristotle University of Thessaloniki,155.207.132.190,22.9375,0,0.186086,-999.0,"{'winddirection': 235, 'temperature': 19.3, 't...",40.625
4,1.60.174.157,success,China,CN,HL,Heilongjiang,Harbin,,45.8038,126.535,...,China Unicom Heilongjiang Province Network,,AS4837 CHINA UNICOM China169 Backbone,1.60.174.157,126.5,0,2.333045,133.375,"{'weathercode': 0, 'time': '2022-04-26T18:00',...",45.75


In [52]:
# pandas made it easy to inspect our data, such as seeing the set of countries we collected data from
df_weather["country"].unique()

array(['United States', 'Greece', 'China', 'Canada', 'Netherlands',
       'Germany', 'Japan', 'South Korea'], dtype=object)

And there you have it! At this point, we have used several public APIs to collect location and weather data about imaginary visitors to our company's website. We've organized this data into a Pandas DataFrame format, which will make it easy to combine with additional data, and to use for downstream analysis or modeling applications.

## 2.3 :: Join with Migration Data

We learned a lot about our individual visitors by inspecting their IP address, and calling other APIs to collect supplemental information off of this.

Often times, relevant data might exist in a database or table format. For example, consider the case where our website may be offering relocation services, such as a moving company or a service that helps individuals find job opportunities in new countries. For a use-case like this, it could be valuable to learn about the typical migration rates in and out of the countries in which our visitors reside.

Luckily for us, the United Nations publishes migration rates at the country level publically, and we can download this data for free. After accessing this data, we can join it to our visitors data table using a key of "Country".

XLSX file available from the UN:

https://population.un.org/wpp/Download/Standard/Migration/

In case the location of this file changes, we've also attached a copy of it to this Repo.

In [54]:
# skip the first few rows, which just contain extra header information
df_migration = pd.read_excel("https://population.un.org/wpp/Download/Files/1_Indicators%20(Standard)/EXCEL_FILES/4_Migration/WPP2019_MIGR_F01_NET_MIGRATION_RATE.xlsx", skiprows=range(16))

In [55]:
df_migration.head()

Unnamed: 0,Index,Variant,"Region, subregion, country or area *",Notes,Country code,Type,Parent code,1950-1955,1955-1960,1960-1965,...,1970-1975,1975-1980,1980-1985,1985-1990,1990-1995,1995-2000,2000-2005,2005-2010,2010-2015,2015-2020
0,1,Estimates,WORLD,,900,World,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Estimates,UN development groups,a,1803,Label/Separator,900,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2,3,Estimates,More developed regions,b,901,Development Group,1803,0.031,0.032,0.533,...,1.269,1.235,1.062,1.235,1.838,2.228,2.745,2.773,2.332,2.215
3,4,Estimates,Less developed regions,c,902,Development Group,1803,-0.014,-0.014,-0.224,...,-0.456,-0.411,-0.327,-0.352,-0.486,-0.551,-0.64,-0.613,-0.491,-0.443
4,5,Estimates,Least developed countries,d,941,Development Group,902,-0.487,-0.567,-0.787,...,-2.699,-2.118,-2.927,-1.573,0.321,-1.389,-1.254,-2.42,-1.716,-0.973


In [56]:
df_migration["Type"].unique()

array(['World', 'Label/Separator', 'Development Group', 'Special other',
       'Income Group', 'Region', 'SDG region', 'Subregion',
       'Country/Area', 'SDG subregion'], dtype=object)

Looking at the data as it's read in, you can make several observations:

- Data is reported at various aggregations, such as country, income, overall (world), etc.
- Metrics are reported at various date ranges. While interesting to have, we are likely going to be most interested in the most recent year range (2015-2020)

Because of our specific interests, let's limit rows to just those reporting on country-level values, and limit columns to just the region name and most recent measurement.

In [57]:
# limit to only country-level rows
df_migration_countries = df_migration.loc[df_migration["Type"]=="Country/Area"]
df_migration_countries.head()

Unnamed: 0,Index,Variant,"Region, subregion, country or area *",Notes,Country code,Type,Parent code,1950-1955,1955-1960,1960-1965,...,1970-1975,1975-1980,1980-1985,1985-1990,1990-1995,1995-2000,2000-2005,2005-2010,2010-2015,2015-2020
26,27,Estimates,Burundi,,108,Country/Area,910,-5.773,-5.252,-5.769,...,-14.847,-7.65,-6.248,-7.066,-11.203,-14.728,-0.719,0.748,-1.487,0.181
27,28,Estimates,Comoros,,174,Country/Area,910,0.0,-6.666,-8.522,...,-4.514,7.078,-2.714,-2.347,-1.353,-2.358,-3.466,-3.074,-2.726,-2.429
28,29,Estimates,Djibouti,,262,Country/Area,910,3.037,13.052,36.254,...,36.47,62.553,5.098,35.432,-14.745,2.964,-2.398,-3.011,1.369,0.947
29,30,Estimates,Eritrea,,232,Country/Area,910,0.232,0.629,1.298,...,1.42,1.235,1.07,-3.754,-28.212,-11.564,17.76,-5.337,-15.108,-11.571
30,31,Estimates,Ethiopia,,231,Country/Area,910,-0.21,-0.19,-0.17,...,-0.391,-12.479,1.312,3.461,5.557,-0.505,-0.421,-0.122,0.849,0.278


In [58]:
# limit to only relevant columns
df_migration_subset = df_migration_countries[["Region, subregion, country or area *", "2015-2020"]]

In [59]:
df_migration_subset.head()

Unnamed: 0,"Region, subregion, country or area *",2015-2020
26,Burundi,0.181
27,Comoros,-2.429
28,Djibouti,0.947
29,Eritrea,-11.571
30,Ethiopia,0.278


Right now, the column names aren't very specific to our limited use case. So, we can rename the columns in our subsetted dataframe to be more interpretable in our downstream dataset. 

In [60]:
df_migration_subset.rename({"Region, subregion, country or area *": "country", "2015-2020": "migration_rate"}, axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_migration_subset.rename({"Region, subregion, country or area *": "country", "2015-2020": "migration_rate"}, axis=1, inplace=True)


In [61]:
df_migration_subset.head()

Unnamed: 0,country,migration_rate
26,Burundi,0.181
27,Comoros,-2.429
28,Djibouti,0.947
29,Eritrea,-11.571
30,Ethiopia,0.278


At this point, we have a cleaned up DataFrame with weather data (at the visitor-level), and a cleaned up DataFrame with migration data (at the country-level). To be able to look at these metrics together, we will join the data together. Because our ultimate goal is to have all data at the website visitor level, we will want to perform a left join of the migration data to the weather data, as the migration data is aggregated at a coarser level.

In [64]:
df_weather_migration = df_weather.merge(df_migration_subset, how="left", left_on="country", right_on="country")

And, voila!

In [65]:
df_weather_migration.head()

Unnamed: 0,ip,status,country,countryCode,region,regionName,city,zip,lat,lon,...,org,as,query,longitude,utc_offset_seconds,generationtime_ms,elevation,current_weather,latitude,migration_rate
0,8.95.7.7,success,United States,US,LA,Louisiana,Monroe,71203.0,32.5896,-92.0669,...,"Level 3, LLC","AS3356 Level 3 Parent, LLC",8.95.7.7,-92.125,0,2.327919,23.09375,"{'time': '2022-04-26T18:00', 'weathercode': 1,...",32.625,
1,164.104.181.14,success,United States,US,CO,Colorado,Fort Collins,80525.0,40.5377,-105.0546,...,Poudre School District R,AS54060 Poudre School District R-1,164.104.181.14,-105.0,0,0.120997,1504.0,"{'temperature': 16.8, 'winddirection': 157, 't...",40.5,
2,56.100.209.164,success,United States,US,IL,Illinois,Chicago,60666.0,41.8781,-87.6298,...,United States Postal Service,,56.100.209.164,-87.625,0,0.123978,182.125,"{'weathercode': 2, 'windspeed': 17.3, 'winddir...",41.875,
3,155.207.132.190,success,Greece,GR,B,Central Macedonia,Thessaloniki,,40.6381,22.9455,...,AUTH,AS5470 Aristotle University of Thessaloniki,155.207.132.190,22.9375,0,0.186086,-999.0,"{'winddirection': 235, 'temperature': 19.3, 't...",40.625,-1.518
4,1.60.174.157,success,China,CN,HL,Heilongjiang,Harbin,,45.8038,126.535,...,,AS4837 CHINA UNICOM China169 Backbone,1.60.174.157,126.5,0,2.333045,133.375,"{'weathercode': 0, 'time': '2022-04-26T18:00',...",45.75,-0.245


## 2.4 :: Data Validation and Cleaning

So far things are looking pretty good, but let's dig a little bit deeper to see how things turned out after the join. One thing to be cautious about here is that our left join will still return a result if there are cases where there may not have been a match. For example, if a particualr visitor's country doesn't have a perfect match in the migration dataset, it will remain a row in our dataframe, but all of the weather and location data will be left empty!

In [67]:
# a few countries did match up
df_weather_migration.loc[df_weather_migration["migration_rate"].notna(), "country"].unique()

array(['Greece', 'China', 'Canada', 'Netherlands', 'Germany', 'Japan'],
      dtype=object)

In [68]:
# some countries didn't match because names are different
df_weather_migration.loc[df_weather_migration["migration_rate"].isna(), "country"].unique()

array(['United States', 'South Korea'], dtype=object)

Based on the findings of the above cell (listing out the contries where migration rate is empty) we can see a list of countries that don't have an exact string match to the migration data. To help troubleshoot this, we can search the migration data for entries that contain at least a partial string match.

In [69]:
# search for strings containing `United States`
# this shows that the migration data refers to this country as `United States of America`
# we can clean this up prior to the join, and then they should match up
df_migration_subset.loc[df_migration_subset["country"].str.contains("United States")]

Unnamed: 0,country,migration_rate
164,United States Virgin Islands,-4.306
254,United States of America,2.929


In [70]:
# to get join to work, let's rename country in the migration dataset
# this dict of replacements came from running a number of IPs through our process
# it may not be exhaustive

df_migration_subset["country"].replace(to_replace={
    "United States of America": "United States", 
    "Syrian Arab Republic": "Syria", 
    "Russian Federation": "Russia", 
    "Republic of Korea": "South Korea",
    "Venezuela (Bolivarian Republic of)": "Venezuela",
    "Viet Nam": "Vietnam",
    "China, Taiwan Province of China": "Taiwan",
    
}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_migration_subset["country"].replace(to_replace={


In [71]:
# try the join again
df_weather_migration = df_weather.merge(df_migration_subset, how="left", left_on="country", right_on="country")

In [73]:
# now see if all entries have a match
# empty array means that no entries are missing migration data
df_weather_migration.loc[df_weather_migration["migration_rate"].isna(), "country"].unique()

array([], dtype=object)

At this point, we've created a dataset containing location, weather, and migration data for visitors to our website. Depending on your use-case, at this point you may decide to add in additional data sources, perform feature engineering, or implement extra cleanup and data validation steps.

# Part 3 :: Data visualization and analysis