<table align="center">
   <td align="center"><a target="_blank" href="https://colab.research.google.com/github/ds5110/summer-2021/blob/master/03b-DataLoading.ipynb">
<img src="https://github.com/ds5110/summer-2021/raw/master/colab.png"  style="padding-bottom:5px;" />Run in Google Colab</a></td>
</table>

# 3b -- Data Loading

### Reading/References

* [Python for Data Analysis, 2nd Ed](https://github.com/wesm/pydata-book) (McKinney 2017) -- github
  * [ch05.ipynb](https://github.com/wesm/pydata-book/blob/2nd-edition/ch05.ipynb) getting started with pandas -- github
  * [ch06.ipynb](https://github.com/wesm/pydata-book/blob/2nd-edition/ch06.ipynb) data loading & storage -- github
  * [ch07.ipynb](https://github.com/wesm/pydata-book/blob/2nd-edition/ch07.ipynb) cleaning and preparation -- github


In [None]:
# McKinney setup -- standard practice for a pro
# import numpy as np
# import pandas as pd
# PREVIOUS_MAX_ROWS = pd.options.display.max_rows
# pd.options.display.max_rows = 20
# np.random.seed(12345)
# import matplotlib.pyplot as plt
# plt.rc('figure', figsize=(10, 6))
# np.set_printoptions(precision=4, suppress=True)

In [None]:
import pandas as pd
import numpy as np

# Loading data with Pandas

There are many ways it can be done...

* `read_csv` -- load delimited data from a file, URL, or file-like object; use comma as default delimiter
* `read_fwf` -- data in fixed-width column format (i.e., no delimiters)
* `read_clipboard` -- version of read_csv that reads data from the clipboard; useful for converting tables from web pages
* `read_excel` -- tabular data from an Excel XLS or XLSX file
* `read_hdf` -- HDF5 files written by pandas
* `read_html` -- read all tables found in the given HTML document
* `read_json` -- data from a JSON (JavaScript Object Notation) string representation
* `read_msgpack` -- pandas data encoded using the MessagePack binary format
* `read_pickle` -- an arbitrary object stored in Python pickle format
* `read_sas` -- a SAS dataset stored in one of the SAS system’s custom storage formats
* `read_sql` -- the results of a SQL query (using SQLAlchemy) as a pandas DataFrame
* `read_stata` -- a dataset from Stata file format
* `read_feather` -- the Feather binary file format
  * https://wesmckinney.com/pages/about.html -- Feb 2020
  * https://wesmckinney.com/blog/feather-arrow-future/ -- Oct 2017
  * https://wesmckinney.com/blog/apache-arrow-pandas-internals/ -- Sep 2017

# Loading a CSV from github

Navigate to the file of interest and copy the "Raw" URL

* https://github.com/wesm/pydata-book/tree/2nd-edition/examples

In [None]:
url = "https://github.com/wesm/pydata-book/raw/2nd-edition/examples/ex1.csv"

df = pd.read_csv(url)
df

In [None]:
# Loading a large file
url = "https://raw.githubusercontent.com/wesm/pydata-book/2nd-edition/examples/ex6.csv"

df = pd.read_csv(url)
df

In [None]:
# Reading a file in pieces

pd.read_csv(url, nrows=5)

## Dates & times

As may be familiar by now, there are core Python capabilities, NumPy extensions, and Pandas conveniences built on top.

* [03.11 Working with Time Series.ipynb](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.11-Working-with-Time-Series.ipynb) -- a whirwind tour with an interesting example
* [pandas.Series.dt](https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.html) API reference docs -- pandas.pydata.org

In [None]:
# the built-in datetime module allows you to create a date object
from datetime import datetime

dt = datetime(year=2015, month=7, day=4)

print(type(dt))
print(dt)

In [None]:
# the third-party dateutil module parses text into datetime.datetime objects
from dateutil import parser
date = parser.parse("4th of July, 2015")
print(type(date))
print(date)

In [None]:
# once you have a datetime object, you can do things like print the day of the week
date.strftime('%A')

In [None]:
# pandas Series.dt object
seconds_series = pd.Series(pd.date_range("2000-01-01", periods=3, freq="s"))

print(type(seconds_series))
print(seconds_series.dt.second)

# HTML

Read HTML tables into a list of DataFrame objects.

Also known as "web scraping"

A demo with dates...

In [None]:
# Read tables in an HTML file

url = "https://raw.githubusercontent.com/wesm/pydata-book/2nd-edition/examples/fdic_failed_bank_list.html"

tables = pd.read_html(url)

print(type(tables))
print(len(tables))
print(type(tables[0]))

failures = tables[0]
failures.head()

In [None]:
# It's worth inspecting thing in detail, at least once.
print('1', type(failures['Closing Date'])) # a Series object pulled from the DF
print('2', type(failures['Closing Date'][0])) # a string
print('3', failures.loc[0, 'Closing Date']) # formatted as a date

close_timestamps = pd.to_datetime(failures['Closing Date'])

print('4', type(close_timestamps)) # a Series object
print('5', type(close_timestamps[0])) # a pandas Timestamp object
print('6', close_timestamps[0]) # a pandas Timestamp object (printed)

print('value counts:', close_timestamps.dt.year.value_counts())

close_timestamps.dt.year

close_timestamps[0]

# JSON

A web standard for data (as distinct from web scraping)


In [None]:
import json

obj = """
{"name": "Wes",
 "places_lived": ["United States", "Spain", "Germany"],
 "pet": null,
 "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
              {"name": "Katie", "age": 38,
               "pets": ["Sixes", "Stache", "Cisco"]}]
}
"""

print(type(obj))

result = json.loads(obj)
print(type(result))

result

# Web API

* [Python requests](https://docs.python-requests.org/en/master/) can be used for 2-way communication to APIs
  * [Requests quickstart](https://docs.python-requests.org/en/latest/user/quickstart/) -- python-requests.org
* The typical response of modern Web APIs is JSON
* For example, github has a nice Web API for publicly accessible repositories
* Use the online documentation to get more information about various objects and methods...

In [None]:
# Import some data from the Github web API
import requests

url = 'https://api.github.com/repos/pandas-dev/pandas/issues'

resp = requests.get(url)
print('1:', type(resp))
print('2:', resp)

# Parse the response as JSON
data = resp.json()

# inspect the data object
print('3:', type(data)) # list
print('4:', type(data[0])) # first element in the list
print('5:', data[0]['title']) # one attribute in that element
print('6:', data[0]) # that element in its entirety

# Create a DataFrame from the list
issues = pd.DataFrame(data, columns=['number', 'title',
                                    'labels', 'state'])
print('7: the dataframe:')
issues

## USGS Earthquake data feed

* [Earthqake data from the USGS Hazards Program](https://earthquake.usgs.gov/earthquakes/feed/v1.0/geojson.php) -- usgs.gov

EXERCISE: How many earthquakes in the last day?

# Missing data

There are several ways to represent missing data.

* None -- native Python singleton object often used for missing data (slow)
* [np.nan](https://numpy.org/doc/stable/reference/constants.html?highlight=nan#numpy.nan) -- IEEE floating point representation of "Not A Number" (fast)
* pd.NA -- pandas._libs.missing.NAType (special object)

In [None]:
# None is a "NoneType", np.nan is a float, pd.NA is a special object
print(type(3))
print(type(3.14159))
print(type(np.nan))
print(type(None))

arr = np.array([1, 42])
print('arr.dtype (before):', arr.dtype)
#arr = arr + None  # this throws a TypeError
arr = np.array([1, 42]) + pd.NA
print('arr.dtype (during):', arr.dtype)
arr = arr + np.nan # this works, but converts the dtype
print('arr.dtype (after):', arr.dtype)

In [None]:
# Be careful when checking a numpy array for missing data
# Try each of these in succession
#arr = np.array(['aardvark', np.nan]) # This will throw an Error because 'aardvark' is a string
#arr = np.array([1, 2, None]) # This will also throw an error because None is an object
arr = np.array([1,2, np.nan]) # This works because the array contains "numeric" values
np.isnan(arr)

In [None]:
np.array(['hello', 'world', np.nan]) # dtype: unicode

In [None]:
np.array(['hello', 'world', None]) # dtype: object

# Missing data in Pandas

* Pandas has its own pd.NA (relatively new)
* In Pandas, NaN and None are nearly interchangeable, with surprising behavior
* [Working with missing data](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html) -- pandas.pydata.org

In [None]:
# Pandas is more forgiving, but some results might be surprising
arr = np.array(['aardvark', pd.NA, np.nan, None])
print(pd.isnull(arr))

arr

In [None]:
# Notice that this Series is dtype: Int64
arr = pd.Series([2, pd.NA], dtype="Int64")
arr

In [None]:
# And NA has its own type
type(pd.NA)

In [None]:
# Pandas converts None to np.nan in this Series with dtype: float64
pd.Series([None, 42])

In [None]:
# Pandas converts the Series to dtype: object if it has a pd.NA
pd.Series([42, pd.NA])

In [None]:
# But you can specify dtype="Int64"
pd.Series([42, pd.NA], dtype="Int64")

# Handling missing values

Methods for dealing with missing data in Pandas

* `isnull()`: Generate a boolean mask indicating missing values
* `notnull()`: Opposite of isnull()
* `dropna()`: Return a filtered version of the data
* `fillna()`: Return a copy of the data with missing values filled or imputed

In [None]:
# These methods work with Pandas Series
series = pd.DataFrame([1, np.nan, 'hello', None])

# The return value is a copy (see reference docs)
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dropna.html
series.dropna()

In [None]:
# They also work with a Pandas DataFrame
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [pd.NA, None, np.nan],
                   [np.nan, 4,      6]])
df

In [None]:
# By default -- the entire row is dropped if it contains an NA
df.dropna()

In [None]:
# You can override the default behavior, and drop columns instead
df.dropna(axis=1)

In [None]:
# You can also specify that all elements must be NA before dropping a row/column
df.dropna(how='all')

In [None]:
# You can fill null values in any of several ways. With a single value...
df.fillna('42')

In [None]:
# The previous value (default: along index=0)
df.fillna(method='bfill')

In [None]:
# The previous value along axis=1
df.fillna(method='bfill', axis=1)

# Missing penguins

In [None]:
import seaborn as sns

# load the "penguins" dataset from seaborn
penguins = sns.load_dataset("penguins")

# inspect the dataset (note: there are some NaNs)
penguins

In [None]:
penguins.isnull()
penguins.isnull().sum()

## Q: Where are the missing penguins?

In [None]:
df = penguins[penguins['sex'].isnull()]
df

In [None]:
sns.pairplot(df, hue="species");

### Q: Why does the plot above have 5 columns & rows, whereas the next one has only 4?

In [None]:
sns.pairplot(penguins, hue="species");

In [None]:
# Inspect all the penguins
penguins

In [None]:
# Inspect the missing penguins
df

In [None]:
# A: Specify variables within data to use, otherwise use every column with a numeric datatype
# Get this from API ref docs: https://seaborn.pydata.org/generated/seaborn.pairplot.html
# 5th variable because missing penguins are all floats (nan), not float/string combination
# Fix things by specifying "vars"
vars = ["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"]
sns.pairplot(df, hue="species", vars=vars)
# all penguins
print(type(penguins.loc[3, 'sex']))
print(penguins.loc[3, 'sex'])
print(penguins['sex'].dtype)

# missing penguins
print(type(df.loc[3, 'sex']))
print(df.loc[3, 'sex'])
print(df['sex'].dtype)