# Exploratory Data Analysis in Python

<hr>

## What we'll cover

* [Accessing data sources with Python](#Accessing-data-sources-with-Python)
  * [Web Scraping](#Web-Scraping)
  * [APIs](#APIs)
  * [Flat files](#Flat-files)
  * [Databases](#Databases)
* [Additional Materials](#Additional-Materials)

<hr>

<hr>

## Accessing data sources with Python

<hr>

Once you have a good grasp of Python's basic fucntionality, you can interact with a number of data sources. This section will focus on the basics of extracting, tranforming, and loading data formats into dataframes for analysis. Data manipulation inside of the dataframes will be saved for Part 5.

<hr>

## Basics

<hr>

There are several key terms and concepts to be aware of when collecting data for analysis and visualization:

* Primary Sources - collected directly from the original source
* Secondary Sources - collected by an intermediary
* Explicitly Spatial - for data where location patterns are directly analyzed
* Implicitly Spatial - for data that represents location, but is not directly analyzed spatially
* Individual Data - data that represents an single unit of something
* Aggregate Data - data that represents a sum of single units of something
* Discrete Data - a data type representing a count of something and values are finite
* Continuous Data - a data type representing an interval/measure of something and values are potential infinite
* Qualitative Data - attributes, labels, non-numerical entries
* Quantitative Data - numerical measurements, counts

<hr>

## Web Scraping

<hr>

### Urllib and IO

The first scraper we'll build will use core Python libraries to:

* Go to a HTTP website
* Gather the source code
* Print the output

In [None]:
# Here we'll import urllib, io, and pprint modules to obtain out data

from urllib.request import Request, urlopen
from io import TextIOWrapper
from pprint import pprint

# Declare the URL
url = 'https://en.wikipedia.org/wiki/Doune_Castle'

# Open the URL
page = Request(url)
page_content = urlopen(page)
# page_content.read()

# Buffer our text stream from the website
page_data = TextIOWrapper(page_content)

# pprint out our data
for row in page_data:
    pprint(row)

### Requests and BeautifulSoup

However, we may want something a bit more elegant. This is where `requests` and `beautifulsoup` comes in to help us out.

In [None]:
# Import requests and beautifulsoup
# Import pandas, we'll use that at the end
import requests
from bs4 import BeautifulSoup
import pandas as pd

# we are going to scrape crime data from the UK crime http://www.uky.edu/crimelog/
# substitute variables to fill in REST query criteria
start_month, start_day, start_year = 1, 1, 2018
end_month, end_day, end_year = 10, 4, 2018
crime_data_raw = requests.get('http://www.uky.edu/crimelog/log?field_log_category_value=All' +
                              '&field_log_report_value%5Bmin%5D%5Bmonth%5D=' + str(start_month) +
                              '&field_log_report_value%5Bmin%5D%5Bday%5D=' + str(start_day) +
                              '&field_log_report_value%5Bmin%5D%5Byear%5D=' + str(start_year) +
                              '&field_log_report_value%5Bmax%5D%5Bmonth%5D=' + str(end_month) +
                              '&field_log_report_value%5Bmax%5D%5Bday%5D=' + str(end_day) +
                              '&field_log_report_value%5Bmax%5D%5Byear%5D=' + str(end_year)
                             )


In [None]:
# create a soup object 
crime_bs_proc = BeautifulSoup((crime_data_raw.text), "html5lib")

In [None]:
# create a filter for our soup object to pull out the table
crime_data_table = crime_bs_proc.find('table', {'class': 'views-table cols-8'})

In [None]:
# find the table header in the data
crime_data_header = crime_data_table.find('thead')

In [None]:
# find all the table headers
crime_data_heads = crime_data_header.find_all('th')

In [None]:
# create an empty list for the header
header = []

# iterate through the header element to get text
for col in crime_data_heads:
    cols = col.find_all('a')
    cols = [ele.text.strip() for ele in cols]
    header.append([ele for ele in cols if ele])

# flatten the list to a single list
header = [item for sublist in header for item in sublist]

In [None]:
# find the table rows in the data
crime_data_body = crime_data_table.find('tbody')

In [None]:
# find all table rows
crime_data_rows = crime_data_body.find_all('tr')

In [None]:
# create an empty list for the rows of data
data = []

# iterate through the header element to get the rows
for row in crime_data_rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele])

In [None]:
# create a dataframe with our data using our header list
uk_crime_data = pd.DataFrame(data, columns=header)
uk_crime_data.head()

There is also the `scrapy` library in Python for more complex scraping projects.

<hr>

## APIs

<hr>

APIs often have 'wrappers' in Python that you can use to interface with the underlying data.

Here we will use the data.world API to import some data

  * docs at https://github.com/datadotworld/data.world-py

Prior to this, you should load your API credentials from data.world into your active virtual env (in the terminal)

`dw configure`

or

`export DW_AUTH_TOKEN=<YOUR_TOKEN>`

In [None]:
# import our API library

import datadotworld as dw

In [None]:
# load our data sets from the API using a known user data collection

afg_conflict = dw.load_dataset('ochaafghanistan/a7f147de-1345-49a0-89f9-563fd7f541b1')

In [None]:
# list the dataframes available in the data set collection

afg_conflict.dataframes

In [None]:
# load a data set into a dataframe from the data collection

afg_df = afg_conflict.dataframes.get('afghanistan_conflict_displacements_2021_csv_1')
afg_df.head(5)

<hr>

## Flat files

<hr>

There are several ways to import flat files for analysis.

The simplest method is to use `pandas` as it supports several well known formats

However, for each of the following files, there are core and 3rd party libraries you can also use to load your data.

In [None]:
# import pandas
import pandas as pd

### CSV

In [None]:
# read csv with pandas

census_fl_csv = pd.read_csv('data/census_2019_fl.csv')
census_fl_csv.head(2)

In [None]:
# you can use the csv library to import/manipulate csv files

import csv

with open('data/census_2019_fl.csv') as census_fl_csv_2:
    reader = csv.DictReader(census_fl_csv_2)  # You can also use csv.reader
    for row in reader:
        print(row)

### Excel

In [None]:
# read excel in xls format with pandas

census_fl_xls = pd.read_excel('data/census_2019_fl.xls')
census_fl_xls.head(2)

In [None]:
# read excel in xlsx format with pandas

census_fl_xlsx = pd.read_excel('data/census_2019_fl.xlsx')
census_fl_xlsx.head(2)

### JSON

In [None]:
census_fl_csv.to_ ('data/census_2019_fl')

In [None]:
# read json with pandas

census_fl_json = pd.read_json('data/census_2019_fl.json')
census_fl_json.head(2)

In [None]:
# you can also use the core json library to import json data

import json

with open('data/census_2019_fl.json') as census_fl_json_2:
    reader = json.load(census_fl_json_2)
    for row in reader:
        print(row)

### XML

In [None]:
# read xml into dataframe using core xml library

import xml.etree.ElementTree as et

root = et.parse('data/census_2019_fl.xml')  # use element tree to parse the xml data
rows = root.findall('row')  # find all row elements in xml
# iterate and select elements in row
data = [[row.find('geoid').text, row.find('label').text, row.find('totpop').text] for row in rows]
# push above data into pandas dataframe
census_fl_xml = pd.DataFrame(data, columns=['geoid', 'label', 'totpop'])
census_fl_xml.head(2)  # check your dataframe

In [None]:
# read xml into dataframe using lxml library

from lxml import objectify
# use objectify to parse xml data
xml_data = objectify.parse(open('data/census_2019_fl.xml'))
root = xml_data.getroot()  # select root tree in xml data
# create an empty list as destination for our data
data = []
# for the row data in our root data
for elt in root.row:
    # create and empty dictionary
    el_data = {}
    # for each child element in row, extract the tag with data and append the list 'data'
    for child in elt.getchildren():
        el_data[child.tag] = child.pyval
    data.append(el_data)
# create a pandas dataframe for data list
census_fl_xml_2 = pd.DataFrame(data)
# check your dataframe
census_fl_xml_2.head(2)

### Binary

In [None]:
# import binary with pandas

census_fl_binary = pd.read_pickle('data/census_2019_fl')
census_fl_binary.head(2)

Know that `pandas` also supports many other file formats such as `hdf5`, `stata`, `SQL`, `html`, `sas`, and even data from your `clipboard`.

### PDF

In [None]:
# read pdf files... may god have mercy on your soul.

import pdfx
import pprint
# after pdfx import, create a PDFx object for our PDF
census_fl_pdf = pdfx.PDFx('data/census_2019_fl.pdf')
# extract metadata for PDF
census_fl_pdf_metadata = census_fl_pdf.get_metadata()
# extract references and place them in a dictionary, hyperlink extraction also possible
census_fl_pdf_refs = census_fl_pdf.get_references_as_dict()
# extract the body of text from PDF
census_fl_pdf_text = census_fl_pdf.reader.get_text()

This is a great starting point for extracting text, metadata, and references (with hyperlinks) from PDFs (very useful for social scientists). However, there are a few ways to extract tabular data from PDFs and none are very easy. The techniques through which the tabular text can be restructured for a dataframe will be covered in Part 4.

### DOCX

In [None]:
# read docs files...

import docx
# create a document object for our docx file
doc = docx.Document('data/census_2019_fl.docx')

In [None]:
# check the number of paragraphs
len(doc.paragraphs)

In [None]:
# pull the text out of the first paragraph
doc.paragraphs[0].text

In [None]:
# extract and output contents of tables
table = doc.tables[0]
# create empty list for preprocessing
data = []
# for each row in the table
# for each cell in row
# add the cell to the list 'data'
for row in table.rows:
    for cell in row.cells:
        data.append(cell.text)
# create a function to split our long list into n size chunks equal to # of headers
def sublist_gen(l, n):
    for i in range(0, len(l), n):
        yield l[i:i + n]
# use our function to create list of list
# first list == headers
sub_data = list(sublist_gen(data, 6))
# extract our headers
headers = sub_data.pop(0)
# create a dataframe from lists
docx_table_dataframe = pd.DataFrame(sub_data, columns=headers)

In [None]:
# check your dataframe
docx_table_dataframe

<hr>

## Databases

<hr>

### SQLite

Python3 comes with sqlite and it can be a power tool for data exploration. We'll cover databases more in the next section, but sqlite is a great way to store your data as you perform your EDA.

Download the SQLite sample data and diagram from https://www.sqlitetutorial.net/sqlite-sample-database/ and save it to the data folder.


In [None]:
import sqlite3
import pandas as pd

In [None]:

# read data from sqlite3 database
connection = sqlite3.connect('data/chinook.db')
# use pandas to read a table from the database connection to create a dataframe
customers = pd.read_sql_query("SELECT * FROM customers", connection)
# close the database connection once you're done creating your pandas dataframe
connection.close()

In [None]:
# test your dataframe
customers.head(2)

<hr>

## Additional Materials

<hr>

### For Future Versions

* [Newspaper](https://github.com/codelucas/newspaper/)


<hr>

## Resources

<hr>

**Note:** A lot of the open-source materials are provided by people who develop those materials for a living. So please consider sending them a thank you and if you can, a few buck to support their efforts. Thanks! :)    

* [Pandas](https://pandas.pydata.org/pandas-docs/stable/)
* [urllib](https://docs.python.org/3/library/urllib.html)
* [io](https://docs.python.org/3/library/io.html)
* [pprint](https://docs.python.org/3/library/pprint.html)
* [requests](http://docs.python-requests.org/en/latest/)
* [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/)
* [datadotworld](https://github.com/datadotworld/data.world-py)
* [csv](https://docs.python.org/3/library/csv.html)
* [json](https://docs.python.org/3/library/json.html)
* [xml](https://docs.python.org/3/library/xml.html)
* [lxml](https://lxml.de/)
* [pdfx](https://github.com/metachris/pdfx)
* [python-docx](https://python-docx.readthedocs.io/en/latest/)
* [sqlite](https://docs.python.org/3/library/sqlite3.html)