# Data in Python

- data files in Python
    - semi-structured files
    - `pandas`
    - Web Scraping & APIs
- Working With Data

## Setup

In [None]:
# Import standard libraries
%matplotlib inline
import pandas as pd
import numpy as np

## Data 'Friendliness'

The degree to which a data filetype easily lends itself to useful analysis.

## 'Friendly' File Types:

- csv
- tsv
- json
- txt
- xml

## 'Unfriendly' File Types:
- pdf
- docx
- html
- Anything made to look nice for humans

### CSV Files

- 'Comma Separated Value' files store data, separated by comma's. 
- Think of them like lists.

In [None]:
# Note: through this notebook, I will be using '!' to run the shell command 'cat'
#  to print out the content of example data files

!cat data/dat.csv

In [None]:
# Python has a module devoted to working with csv's
import csv

In [None]:
# We can read through our file with the csv module
with open('data/dat.csv') as csvfile:
    csv_reader = csv.reader(csvfile, delimiter=',')
    for row in csv_reader:
        print(', '.join(row))

In [None]:
# Pandas also has functions to directly load csv data
pd.read_csv?

In [None]:
# Let's read in our csv file
pd.read_csv('data/dat.csv', header=None) 

## iclicker Question #1

What does `pd` in `pd.read_csv()` specify?

- A) it's the name of the function
- B) that the `read_csv` method is from the pd package
- C) that the `read_csv` method is from the pandas package (we're using the shortcut `pd`)
- D) to read a csv file into python
- E) I'm super lost

### JSON

- JavaScript Object Notation files can store hierachical key/value pairings. 
- Think of them like dictionaries.

In [None]:
!cat data/dat.json

In [None]:
# Think of json's as similar to dictionaries
d = {'firstName': 'John', 'age': '53'}
print(type(d),'\n',d)

In [None]:
# Python also has a module for dealing with json
import json

In [None]:
# Load a json file
with open('data/dat.json') as dat_file:    
    dat = json.load(dat_file)

In [None]:
# Check what data type this gets loaded as
print(type(dat))

In [None]:
# Pandas also has support for reading in json files
pd.read_json?

In [None]:
# You can read in json formatted strings with pandas
pd.read_json('{ "first": "Alan", "place": "Manchester"}', typ = 'series')

In [None]:
# Read in our json file with pandas
pd.read_json('data/dat.json', typ = 'series')

### XML

- eXtensible Markup Language files store 'tagged' data. 
- Think of them like HTML.

In [None]:
!cat data/dat.xml

In [None]:
# We can read in the XML file with standard python I/O
with open('data/dat.xml') as dat_file:
    dat = dat_file.read()

In [None]:
# Check out the data
dat

In [None]:
# Beautiful Soup has functions to 'clean up' XML into human-friendlier formats
from bs4 import BeautifulSoup
nice_dat = BeautifulSoup(dat, 'xml')

In [None]:
# Check out the parsed data
print(nice_dat)

<center>
<img src="img/pandas.png" alt="pandas" width="600px">
</center>

Pandas is Python library for managing heterogenous data.

At it's core, Pandas is used for the **DataFrame** object, which is:
- a data structure for labeled rows and columns of data
- associated methods and utilities for working with data.
- each column contains a `pandas` **Series**

## Loading Data

In [None]:
# Load a csv file of data
df = pd.read_csv('data/my_data.csv')

In [None]:
# Check out a few rows of the dataframe
df.head()

Pandas DataFrame:
- Index for each row
- Column name for each column
- Stores heterogenous types

## Indexing & Slicing

In [None]:
# Indexing: select a column using its name
df['last_name']

In [None]:
type(df['last_name'])

In [None]:
# Indexing: select a row & column with 'loc'
df.loc[10, 'score']

## iclicker Question #2

What would be the output of `df['age'] > 10`?

- A) subset of `df` including only rows of individuals older than 10
- B) a Boolean with `True` for rows where age is greater than 10 and `False` otherwise
- C) `id`s of rows where observations are greater than 10 
- D) an error
- E) I'm super lost

In [None]:
## YOUR CODE HERE
df['age'] > 10

# to get dataframe 
df[df['age'] > 10]


## Checking out the DataFrame

In [None]:
# Check how large our dataframe is
df.shape

In [None]:
# Check what columns we have in our DataFrame
df.columns

In [None]:
# Check the datatypes of our variables
df.dtypes

In [None]:
# Set the index to a string (non-numerical) and use it as index (row labels)
df['id'] = df['id'].astype('str')
df = df.set_index(df['id'])
df.head() 

## Exploring the data

- quantitative (numbers)
- qualitative (categorical)
- basic descriptive statistics

In [None]:
# Checking categorical data
df['first_name'].value_counts()[0:10]

In [None]:
# Check a particular descriptive statistic
df['value'].mean()

In [None]:
# Describe a particular column
df['score'].describe()

In [None]:
# Get descriptive statistics of all numerical columns
df.describe()

## iclicker Question #3

What's the average (mean) age of the individuals in this dataset?

- A) 14
- B) 46
- C) 28730
- D) NA
- E) I'm super lost/unsure

In [None]:
## YOUR CODE HERE
df['age'].mean()

## Application Program Interface (APIs)

- APIs are basically a way for software to talk to software 
    - They are an interface into an application / website / database designed for computers / software.

Notes on APIs:
- Follow API guidelines! 
- These guidelines typically specify the number / rate / size of requests

## Github API

You can access the github api with the following API. Just added specifiers for what you are looking for. 

https://api.github.com/

For example, the following URL will search for the user 'ShanEllis'

https://api.github.com/users/shanellis

<center>
<img src="img/github.png" alt="sql" height="100" width="100">
</center>

## Requesting Web Pages from Python

In [None]:
# The requests module allows you to send URL requests from python
import requests  
from bs4 import BeautifulSoup

In [None]:
# Request data from the Github API on a particular user
page = requests.get('https://api.github.com/users/shanellis')  

In [None]:
# The content we get back is a messily organized json file
page.content

## iclicker Question #6

What type/format of output is this?

- A) CSV
- B) XML
- C) JSON
- D) API
- E) I'm super lost

In [None]:
# We can read in the json data with pandas
git_data = pd.read_json(page.content, typ='series')

In [None]:
# Check out the pandas series object full of data
git_data  

### Authorized Access - OAuth

Open Authorization is a protocol to authorize access (of a user / application) to an API.

OAuth provides a secure way to 'log-in' without using account names and passwords. 

It is effectively a set of keys, and passwords you can use to access APIs. 

## Web Scraping vs. APIs

Web scraping and APIs are different approaches:

- APIs are an interface to interact with an application, designed for programmatic use
    - They allow systematic, controlled access to (for example) and applications database
    - They typically return structured (friendly) data 

- Web scraping (typically) involves navigating through the internet, programmatically following an architecture built for humans
    - This can be hard to systematize, being dependent on the idiosyncracies of a web page, at the time you request it
    - This typically returns relatively unstructured data
    - This entails much more wrangling of the data

# Notes on Working with Data

### Data Science is Ad-Hoc

- It is part of the job description to put things together that were not designed to go together.
- We do not have universal solutions, but haphazard, idiosyncratic systems, for data collection, storage and analysis.
- Data is everywhere. But relatively little of it was collected *as data*.

### Data Collection, Curation, and Storage are Difficult

- It can be difficult to choose broadly useful standards
- Take time to think about your data, and how you will load, store, organize and save it

### Data is Inherently Noisy

- We live in a messy, noisy, world, with messy, noisy, people, using messy, noisy instruments.
- There is no perfect data. 
    - There is better / or worse data, given the context.

### Different Objectives

- Humans and computers are different.
- We interact with '*data*' in different ways.
- This underlies many aspects of data wrangling
    - The 'friendliness' of data types / files
    - The difference between web scraping and APIs
    - A disconnect between data in the real world, and data we want to use

## So... What to do?

- Think about how your data are stored & its structure?
- Look at your data before you anayze it
    - are there missing values? 
    - outlier values? 
- Are your data trustworthy? 
    - source?
    - how was it generated?

## Specific Recommendations

- Prioritize using well structured, common, open file types
    - Take advantage of existing tools to deal with these files (numpy, pandas, etc.)

- Look into, and then follow, common conventions
    - Minimize custom objects, workflows and data files 
- Look for APIs. Ask if they are available.
    - Acknowledge that web scraping and/or wrangling unstructured data are complex / long tasks

- Think about data flow from the beginning. Organize your data pipeline, consider the 'wrangling' aspects throughout
    - Set yourself up with well organized, labelled approach to your data
    - Think about when and how you might want/need to save out intermediate results.