In [None]:
%%R
options(htmltools.dir.version = FALSE)
knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE,
  dev = "svg",
  fig.align = "center",
  #fig.width = 11,
  #fig.height = 5
  cache = FALSE
)

# define vars
om = par("mar")
lowtop = c(om[1],om[2],0.1,om[4])
library(tidyverse)
library(knitr)
library(reticulate)
use_python("C:\\Users\\jbpost2\\AppData\\Local\\Programs\\Python\\Python310\\python.exe")
#use_python("C:\\python\\python.exe")
options(dplyr.print_min = 5)
options(reticulate.repl.quiet = TRUE)

layout: false
class: title-slide-section-red, middle

# Pandas for Reading Raw Data 
Justin Post

---
layout: true

<div class="my-footer"><img src="img/logo.png" style="height: 60px;"/></div> 

---

# Reading Raw Data

- `pandas` library has functionality for reading **delimited data**, Excel data, SAS data, JSON data, etc.
    
- Remember we'll need to import the `pandas` library

In [None]:
import pandas as pd

---

# Data Comes in Many Formats

- 'Delimited' data: Character (such as [','](https://www4.stat.ncsu.edu/~online/datasets/scores.csv) , ['>'](https://www4.stat.ncsu.edu/~online/datasets/umps2012.txt), or [' ']) separated data  

- [Fixed field](https://www4.stat.ncsu.edu/~online/datasets/cigarettes.txt) data  

- [Excel](https://www4.stat.ncsu.edu/~online/datasets/censusEd.xlsx) data  

- From other statistical software, Ex: [SPSS formatted](https://www4.stat.ncsu.edu/~online/datasets/bodyFat.sav) data or [SAS data sets](https://www4.stat.ncsu.edu/~online/datasets/smoke2003.sas7bdat)  

- From an Application Programming Interface (API)

- From a database  



---

# Delimited Data

- Once common format for **raw** data is delimited data

    + Data that has a character or characters that separates the data values
    
    + Character(s) is (are) called **delimiters**

- Using `pandas` the `read_csv()` function can read in the data

    + Just need to tell python where to find it!


---

# Locating a File

- How does python locate the file?  

    + Can give file *full path name*  

        * ex: 'S:/Documents/repos/ST-554/datasets/data.csv'  
        * ex: 'S:\\Documents\\repos\\ST-554\\datasets\\data.csv'  

    + Or use local paths!
    
        * Determine your **working directory**
        * Use a path **relative** to that
        
---

# Locating a File

- How does python locate the file?  

    + Can give file *full path name*  

        * ex: 'S:/Documents/repos/ST-554/datasets/data.csv'  
        * ex: 'S:\\Documents\\repos\\ST-554\\datasets\\data.csv'  

    + Or use local paths!
    
        * Determine your **working directory**
        * Use a path **relative** to that


In [None]:
import os
os.getcwd()
#os.chdir("S:/Documents/repos/ST-554")

---

# Reading .csv Locally

- Nicely formatted `.csv` files can be read in with the `read_csv()` function from `pandas`

- `neuralgia.csv` file in a folder called `Datasets` in my **working directory**

In [None]:
neuralgia_data = pd.read_csv("Datasets/neuralgia.csv")
neuralgia_data.shape
neuralgia_data.head()

---

# Reading .csv From a URL

- Nicely formatted `.csv` files can be read in with the `read_csv()` function from `pandas`

- `scoresFull.csv` file at a **URL** given by 'https://www4.stat.ncsu.edu/~online/datasets/scoresFull.csv'

In [None]:
scores_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/scoresFull.csv")
scores_data.shape
scores_data.head()

---

# Reading Delimited Data

- To read other types of delimited data, we also use `read_csv()`!

- `chemical.txt` file (space delimiter) stored at "https://www4.stat.ncsu.edu/~online/datasets/chemical.txt"

In [None]:
chem_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/chemical.txt", sep=" ")
chem_data.info()

---

# Reading Delimited Data

- To read other types of delimited data, we also use `read_csv()`!

- `crabs.txt` file (tab delimiter) stored at "https://www4.stat.ncsu.edu/~online/datasets/crabs.txt"

In [None]:
crabs_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/crabs.txt", sep="\t")
crabs_data.info()

---

# Reading Delimited Data

- To read other types of delimited data, we also use `read_csv()`!

- `umps2012.txt` file (`>` delimiter) stored at "https://www4.stat.ncsu.edu/~online/datasets/umps2012.txt"

    + No column names in raw file
    

In [None]:
ump_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/umps2012.txt",
                      sep=">", 
                      header=None, 
                      names=["Year", "Month", "Day", "Home", "Away", "HPUmpire"])
ump_data.head()

---

# Reading Excel Data

- Use the `ExcelFile()` function from `pandas`

- `censusEd.xlsx` file located at "https://www4.stat.ncsu.edu/~online/datasets/censusEd.xlsx"

In [None]:
ed_data = pd.ExcelFile("https://www4.stat.ncsu.edu/~online/datasets/censusEd.xlsx")
ed_data

---

# Reading Excel Data

- Use the `ExcelFile()` function from `pandas`

- Different attributes associated with this data frame!

In [None]:
#ed_data.head(), ed_data.info() won't work!  
type(ed_data)
ed_data.sheet_names
ed_data.parse('EDU01A').head()

---

# Reading Excel Data

- Or use the `read_excel()` function from `pandas`

- `censusEd.xlsx` file located at "https://www4.stat.ncsu.edu/~online/datasets/censusEd.xlsx"

In [None]:
ed_data = pd.read_excel("https://www4.stat.ncsu.edu/~online/datasets/censusEd.xlsx", sheet_name = 0)#or "EDU01A"
ed_data.head()
ed_data.info()

---

# Reading Excel Data

- Or use the `read_excel()` function from `pandas`

- Read all sheets with `sheet_name = None`

In [None]:
ed_data = pd.read_excel("https://www4.stat.ncsu.edu/~online/datasets/censusEd.xlsx", sheet_name = None)
type(ed_data)
ed_data.keys()
ed_data.get("EDU01A").info()

---

# Reading JSON Data

- JSON data has a structure similar to a dictionary

    + Key-value pairs

```
{  
  {  
    "name": "Barry Sanders"  
    "games" : 153  
    "position": "RB"  
  },  
  {  
    "name": "Joe Montana"  
    "games": 192  
    "position": "QB"  
  }  
} 
```

- `read_json()` function will work!

---

# Reading JSON Data

- `read_json()` to read in data located at "https://api.exchangerate-api.com/v4/latest/USD"

In [None]:
url = "https://api.exchangerate-api.com/v4/latest/USD"
usd_data = pd.read_json(url)
usd_data.head()

---

# Writing Out Data

- Can write out data using `to_csv()` method

In [None]:
usd_data.to_csv('usd_data.csv', index = False)

---

# To JupyterLab!  

- Read in some SAS data

- Read in data from an API

---

# Recap

- Read data in with `pandas`  

    + `read_csv()` for delimited data
    
    + `ExcelFile()` or `read_excel()` for excel data
    
    + `read_json()` for JSON
    
- Write to a `.csv` with `.to_csv()`

