In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import gzip

from urllib.request import urlopen
from xmltodict import parse
from collections import Counter
from zipfile import ZipFile
from io import BytesIO

# Open Data and Data APIs

Large part of doing data science is working with data: cleaning, understanding, filtering, and tranforming it. But in order to do that we need data. Unless you collect your own data, you will need to find interesting data sets that you can understand and ask questions about. Today, we are going to look at possible data sources and their uses.

An [application programming interface (API)](https://en.wikipedia.org/wiki/API) is a data connection between two pieces of software. For our purposes, it is a connection between a data consumer (you) and data provider.  Its primary function is **not** to provide data for human consumption, rather it is for exchanging data between two computer programs. In short, you'll use an API to fetch the data not to look at it in its raw form.


## Toy Datasets for Teaching and Experimenting

Sometimes, one needs data to learn a topic, make small experiments before embarking on *real* large datasets. For that there is one specific place: [UCI's data repository](https://archive.ics.uci.edu/ml/datasets.php).

Here are a couple of examples:

### The Wine Dataset


The [wine data](https://archive.ics.uci.edu/ml/datasets/wine) consists of chemical analysis of wines produced in Italy from three different producers. Data has 14 columns: 

1. Alcohol
2. Malic acid
3. Ash
4. Alcalinity of ash
5. Magnesium
6. Total phenols
7. Flavanoids
8. Nonflavanoid phenols
9. Proanthocyanins
10. Color intensity
11. Hue
12. OD280/OD315 of diluted wines
13. Proline 

The remaining column is the label (0,1 and 2) indicating different producers.


In [None]:
with urlopen('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data') as url:
    data = pd.read_csv(url, header=None)
    
data

In [None]:
data.describe()

In [None]:
pd.unique(data.iloc[:,0])

In [None]:
plt.hist(data[8],5)

In [None]:
help(pd.DataFrame)

## Mammographic Masses

[Data](https://archive.ics.uci.edu/ml/machine-learning-databases/mammographic-masses/) is about mammography test results on breast tumors.  It is of columnar format (CSV) and contains 7 columns:

1. BI-RADS assessment: 1 to 5 (ordinal, non-predictive!)
2. Age: patient's age in years (integer)
3. Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
4. Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)
5. Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)

The remaining column indicate whether the tumor is benign (encoded as 0) or malignant (encoded as 1).

In [None]:
with urlopen('https://archive.ics.uci.edu/ml/machine-learning-databases/mammographic-masses/mammographic_masses.data') as url:
    data = pd.read_csv(url, header=None)
    
data

In [None]:
data.describe()

In [None]:
Counter(data[5])

In [None]:
plt.hist(data[1],8)

## Bank Marketing Data Set

[The data](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing) is about marketing campaigns of a Portuguese bank. The marketing campaigns were done via phone. Data again is of columnar format. It contains 21 columns:

1. age (numeric)
2. job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3. marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4.  education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5.  default: has credit in default? (categorical: 'no','yes','unknown')
6.  housing: has housing loan? (categorical: 'no','yes','unknown')
7.  loan: has personal loan? (categorical: 'no','yes','unknown')
8.  contact: contact communication type (categorical: 'cellular','telephone')
9.  month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10.  day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11.  duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
12.  campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13.  pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14.  previous: number of contacts performed before this campaign and for this client (numeric)
15.  poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
16.  emp.var.rate: employment variation rate.  quarterly indicator (numeric)
17.  cons.price.idx: consumer price index.  monthly indicator (numeric)
18.  cons.conf.idx: consumer confidence index.  monthly indicator (numeric)
19.  euribor3m: euribor 3 month rate.  daily indicator (numeric)
20.  nr.employed: number of employees.  quarterly indicator (numeric)


The remaining column is indicates whether the customer bought the banking service that was advertised/marketed.

In [None]:
with urlopen('https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip') as url:
    zf = ZipFile(BytesIO(url.read()))
    data = pd.read_csv(zf.open('bank.csv'),sep=';')
    
data

In [None]:
data.describe()

In [None]:
Counter(data['y'])

In [None]:
tabl = pd.crosstab(data['education'],data['default'])
tabl

In [None]:
for i in range(4):
    tabl.iloc[i,:] = tabl.iloc[i,:]/sum(tabl.iloc[i,:])
tabl

## Data Science Competitions and Hackathons

[Kaggle](https://kaggle.com) is a data science community page that serves those people who take data challenges as a competition. There are varieties of data science competitions that you can browse, try and even start working on to compete. However, the datasets are not open. This means, you must first register to kaggle before you can look into data and you can download data using their webpage or their specific client. The code below works only if you already are registered and put `kaggle.json` authentication key to your home directory.

In [None]:
from kaggle.api.kaggle_api_extended import KaggleApi
api = KaggleApi()
api.authenticate()

api.dataset_list()

In [None]:
api.competitions_list()

In [None]:
api.dataset_download_files('kamilpytlak/personal-key-indicators-of-heart-disease')
zip = ZipFile('personal-key-indicators-of-heart-disease.zip')
zip.filelist

In [None]:
data = pd.read_csv(zip.open('heart_2020_cleaned.csv'))
zip.close()
data

In [None]:
data.describe()

In [None]:
api.competition_download_files('titanic')
zip = ZipFile('titanic.zip')
zip.filelist

In [None]:
data = pd.read_csv(zip.open('train.csv'))
zip.close()
data

In [None]:
surv = pd.crosstab(data['Survived'],data['Pclass'])

for i in range(2):
    surv.iloc[i,:] = surv.iloc[i,:]/surv.iloc[i,:].sum()

surv

In [None]:
surv = pd.crosstab(data['Survived'],data['Pclass'])

for i in range(3):
    surv.iloc[:,i] = surv.iloc[:,i]/surv.iloc[:,i].sum()
    
surv

# Data From Municipalities

We have used [data from Istanbul Municipality data service](https://data.ibb.gov.tr/). There are other municipalties that serves open data:

1. [Istanbul Municipality](https://data.ibb.gov.tr/)
2. [Izmir Municipality](https://acikveri.bizizmir.com/)
3. [Bursa Municipality](https://acikyesil.bursa.bel.tr/dataset/)
4. [Athens Open Data](http://geodata.gov.gr/en/dataset)
4. [Barcelona Municipality](https://opendata-ajuntament.barcelona.cat/)
5. [London Data Store](https://data.london.gov.uk/developers/)
6. [New York Open Data](https://opendata.cityofnewyork.us/)
7. [City of Montreal Open Data](https://donnees.montreal.ca/collections)
8. [City of Toronto Open Data](https://open.toronto.ca/)

Best way to explore is search "open data api" + your favorite city :) 


Here is the graph of mean natural gas consumption in Istanbul in (numbers are in 10 mil m3) per year


In [None]:
with open('/home/kaygun/local/tmp/2018_Central_Park_Squirrel_Census_-_Squirrel_Data.csv') as file:
    squirrel = pd.read_csv(file)
    
squirrel

In [None]:
pd.crosstab(squirrel['Tail twitches'],squirrel['Indifferent'])

In [None]:
with urlopen('https://data.ibb.gov.tr/dataset/02bdc2d6-94bb-4e31-816e-528bc9d98703/resource/d5fe41b0-3848-4548-9ac7-6e4756c3027b/download/ilce-baznda-yllara-gore-doalgaz-tuketim-miktar-tr-en.xlsx') as url:
    data = pd.read_excel(url.read())

data

In [None]:
(data.groupby('Yıl')['Dogalgaz Tüketim Miktarı (m3)'].mean()/1e+06).plot()

# Data from Government Organizations

1. [European Central Bank](https://sdw.ecb.europa.eu/)
2. [OECD data](https://data.oecd.org/)
3. [The US Central Bank (FED) data](https://fred.stlouisfed.org/)
4. [The World Bank Data](https://data.worldbank.org/)
5. [The US Goverment](https://data.gov/developers/apis/index.html) collected all of its open data sources under a single service.
6. [Indian Government Data Portal](https://data.gov.in/)
7. [European Union Data Portal](https://data.europa.eu/en)
8. [Turkish Supreme Election Council](https://acikveri.ysk.gov.tr/anasayfa) (Yüksek Seçim Kurulu) also publishes critical data on all Turkish elections on their data service.
9. [International Monetary Fund (IMF) Data Portal](https://www.imf.org/en/Data)


In [None]:
with open('/home/kaygun/local/tmp/WEOJanuary2022update.xlsx') as file:
    IMF = pd.read_excel(file.read(),encoding='latin1')
    
IMF

In [None]:
with urlopen("http://www.ecb.europa.eu/stats/eurofxref/eurofxref-daily.xml") as conn:
    data = parse(conn.read())['gesmes:Envelope']['Cube']['Cube']['Cube']

rates = [ float(x['@rate']) for x in data ]
dates = [ x['@currency'] for x in data ]

df = pd.DataFrame({'rates': rates}, index=dates)
df

In [None]:
with urlopen('https://api.worldbank.org/v2/en/indicator/SI.POV.DDAY?downloadformat=xml') as url:
    zip = ZipFile(BytesIO(url.read()))
    print(zip.filelist)
    data = pd.read_xml(zip.open('API_SI.POV.DDAY_DS2_en_xml_v2_3732823.xml'))

# Geological Survey Data

## US Geological Survey 

[USGS](https://www.usgs.gov/products) has a very large data store where you can get variety of scientific data that included Earthquakes, Satellite images, Maps, and much much more.

## European Space Agency

[European Space Agency](https://open.esa.int/) has an excellent open data service from which you can access a variety of data products such as maps, satellite images and more.

## NASA 

[NASA](https://data.nasa.gov/) also has an open data service.


### An Example from NASA

####  Global Landslide Catalog Export

> The Global Landslide Catalog (GLC) was developed with the goal of identifying rainfall-triggered landslide events > around the world, regardless of size, impacts or location. The GLC considers all types of mass movements triggered > by rainfall, which have been reported in the media, disaster databases, scientific reports, or other sources. The > GLC has been compiled since 2007 at NASA Goddard Space Flight Center. This is a unique data set with the ID tag “GLC” in the landslide editor.

In [None]:
with urlopen('https://data.nasa.gov/api/views/dd9e-wu2v/rows.csv?accessType=DOWNLOAD') as url:
    data = pd.read_csv(url)
    
landslidesTR = data.loc[data['country_code'] == 'TR']
landslidesTR[['event_date','admin_division_name','gazeteer_closest_point','longitude','latitude']]

### An Example from USGS

Paleohydrologic reconstructions of water-year streamflow for 31 stream gaging sites in the Missouri River Basin with complete data for 1685 through 1977

In [None]:
with urlopen('https://www.sciencebase.gov/catalog/file/get/5c994278e4b0b8a7f628903e?f=__disk__3d%2F8b%2F51%2F3d8b512c0dec73102f0c8d7b5c3e8d326dce54fa') as url:
    data = pd.read_csv(url)
    
data

In [None]:
(data.iloc[5,5:]).rolling(window=5).mean().plot(figsize=(12,5))

# Image Classification Data Sets

Here is a small sample of image dataset that can be used for image classification tasks:

1. [MNIST](http://yann.lecun.com/exdb/mnist/) 
2. [Extended MNIST](https://www.nist.gov/itl/products-and-services/emnist-dataset)
3. [Fashion MNIST](https://github.com/zalandoresearch/fashion-mnist)
4. [Japanese MNIST](https://github.com/rois-codh/kmnist)
5. [CIFAR](https://www.cs.toronto.edu/~kriz/cifar.html)
6. [Olivetti faces data set](https://scikit-learn.org/0.19/datasets/olivetti_faces.html)

In [None]:
with urlopen('http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz') as url:
    fmnist = np.frombuffer(gzip.open(BytesIO(url.read()),'rb').read(), 
                           dtype=np.uint8,
                           offset=16)

fmnist = fmnist.reshape(10000, 784)

In [None]:
N = np.random.randint(10000)
plt.imshow(fmnist[N].reshape((28,28)))

# Satellite Image Data

1. [Hyperspectral Remote Sensing Scenes](http://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes)
2. [A large list of open GIS Image data sets](https://freegisdata.rtwilson.com/)

# APIs where you would have to register and login

All of the data sources I quoted above are open. You don't neet to enter credentials to login and access the data. However, most commercial data vendors do ask you to register and login before you access their data. 

## NASDAQ Financial Data

Here is an example: [NASDAQ](https://data.nasdaq.com/). Nasdaq is world's first electronic exchange platform for buying and trading securities. It has an extensive data collection on markets. But you would need their specific python library, and also register at their site (you'll need and API key).

In [None]:
!pip install nasdaq-data-link

In [None]:
import nasdaqdatalink
nasdaqdatalink.read_key("/home/kaygun/.config/apikey")

In [None]:
data = nasdaqdatalink.get('CHRIS/CME_W1')
data

In [None]:
data['High'].plot(figsize=(18,6))

## Twitter Data

You can fetch data from [twitter](https://twitter.com) through their [API](https://developer.twitter.com/en/docs/twitter-api/getting-started/getting-access-to-the-twitter-api) and analyze it via python libraries. [Tweepy](https://www.tweepy.org/) is a popular choice, but you should definitely look around and find one that suits your needs.

Here is how I would install and import tweepy: Remember that "!pip install tweepy" should only be used once. Once you install it, you don't have to run that part ever again on your machine. But you must invoke "import tweepy" part every time you are going to use it.

In [None]:
#!pip install tweepy
import tweepy as tw

I put all the necessary login details for the api into a single JSON file in my own directory. When you register at twitter you should replace 'apikey' file with your own with the following structure:

    {"API_key": "your api key",
     "API_secret_key": "your api secret key",
     "access_token": "your access token",
     "access_secret": "your access token secret"}

In [None]:
with open('/home/kaygun/.config/twitter/apikey') as file:
    keys = json.load(file)

This is how you would login:

In [None]:
auth = tw.OAuth1UserHandler(
   keys["API_key"], keys["API_secret_key"], keys["access_token"], keys["access_secret"]
)

api = tw.API(auth)

Let us collect some tweets from this month:

In [None]:
tweets = api.search_tweets(q='William Hurt', lang='en', count=100)

and display 10 of them:


In [None]:
[x.text for x in tweets[:10]]

In [None]:
res = api.search_users('Atabey_Kaygun')

In [None]:
res[0]