<a href="https://colab.research.google.com/github/jbpost2/ST-554-Big-Data-with-Python/blob/main/01_Programming_in_python/16-Pandas_For_Reading_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas for Reading Raw Data
Justin Post

- `pandas` library has functionality for reading **delimited data**, Excel data, SAS data, JSON data, and more!
    
- Remember we'll need to import the `pandas` library in order to use it!

In [4]:
import pandas as pd

Note: These types of webpages are built from Jupyter notebooks (`.ipynb` files). You can access your own versions of them by [clicking here](https://colab.research.google.com/github/jbpost2/ST-554-Big-Data-with-Python/blob/main/01_Programming_in_python/16-Pandas_For_Reading_Data.ipynb). **It is highly recommended that you go through and run the notebooks yourself, modifying and rerunning things where you'd like!**

---

## Data Formats

Raw data comes in many different formats. Understanding the raw data format is essential for reading that data into `python`. Some raw data types include:

- 'Delimited' data: Character (such as [','](https://www4.stat.ncsu.edu/~online/datasets/scores.csv) , ['>'](https://www4.stat.ncsu.edu/~online/datasets/umps2012.txt), or [' ']) separated data  

- [Fixed field](https://www4.stat.ncsu.edu/~online/datasets/cigarettes.txt) data  

- [Excel](https://www4.stat.ncsu.edu/~online/datasets/censusEd.xlsx) data  

- From other statistical software, Ex: [SPSS formatted](https://www4.stat.ncsu.edu/~online/datasets/bodyFat.sav) data or [SAS data sets](https://www4.stat.ncsu.edu/~online/datasets/smoke2003.sas7bdat)  

- From an Application Programming Interface (API) (often returned as a `JSON` file - key/value pairs, similar to a dictionary)

- From a database  

---

## Delimited Data

Let's start with delimited data.

- One common format for **raw** data is delimited data

    + Data that has a character or characters that separates the data values
    
    + Character(s) is (are) called **delimiter(s)**

- Using `pandas` the `read_csv()` function can read in this kind of data (although `csv` stands for 'comma separated value', this function is used for reading most delimited data via `pandas`)

    + If the raw data is well-formatted, we just need to tell `python` where to find it!

---

## Locating a File

- How does python locate the file?  
- Not in colab
  + You can give file *full path name*  
    * ex: 'S:/Documents/repos/ST-554/datasets/data.csv'  
    * ex: 'S:\\Documents\\repos\\ST-554\\datasets\\data.csv'  
 + Or use local paths!
    * Determine your **working directory**
    * Use a path **relative** to that
    * If your working directory is 'S:/Documents/repos/ST-554' you can get to 'data.csv' via 'datasets/data.csv'
- The `os` module gives you access to function for finding and setting your working directory    

Using a cloud-based platform complicates things a bit
- In colab you can
  + [Mount your google drive](https://stackoverflow.com/questions/48376580/how-to-read-data-in-google-colab-from-my-google-drive)
  + Read files from URLs
  + Upload files via the menu on the left (folder icon, then upload a file via the icons there)

In [2]:
import os
#getcwd() stands for get current working directory
os.getcwd() #shows the directory you can get to via the folder icon on the left
#chdir() stands for change current directory
#os.chdir("S:/Documents/repos/ST-554") #won't work in colab but would work on a local python session

'/content'

This `/content` refers to the main folder on the left hand side of `Colab`!

---

## Reading Files 'Locally' in Colab

- Nicely formatted `.csv` files can be read in with the `read_csv()` function from `pandas`

- `neuralgia.csv` has been loaded into the folder on colab in my session. Therefore, it exists in my **working directory**. This won't be the case for you unless you upload the data during your session! You can click on the folder icon on the left, then click the upload button to upload <a href = "https://www4.stat.ncsu.edu/online/datasets/neuralgia.csv" target = "_blank"> this data set</a>.

In [5]:
neuralgia_data = pd.read_csv("neuralgia.csv") #neuralgia.csv file was uploaded to colab for my session
neuralgia_data.head() #this code block won't work unless you upload the data in your session

Unnamed: 0,Treatment,Sex,Age,Duration,Pain
0,P,F,68,1,No
1,B,M,74,16,No
2,P,F,67,30,No
3,P,M,66,26,Yes
4,B,F,67,28,No


In [6]:
neuralgia_data.shape

(60, 5)

---

## Reading From a URL

- Nicely formatted `.csv` files can be read in with the `read_csv()` function from `pandas`

- `scoresFull.csv` file at a **URL** given by 'https://www4.stat.ncsu.edu/~online/datasets/scoresFull.csv'

In [7]:
scores_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/scoresFull.csv")
scores_data.head()

Unnamed: 0,week,date,day,season,awayTeam,AQ1,AQ2,AQ3,AQ4,AOT,...,homeFumLost,homeNumPen,homePenYds,home3rdConv,home3rdAtt,home4thConv,home4thAtt,homeTOP,HminusAScore,homeSpread
0,1,5-Sep,Thu,2002,San Francisco 49ers,3,0,7,6,-1,...,0,10,80,4,8,0,1,32.47,-3,-4.0
1,1,8-Sep,Sun,2002,Minnesota Vikings,3,17,0,3,-1,...,1,4,33,2,6,0,0,28.48,4,4.5
2,1,8-Sep,Sun,2002,New Orleans Saints,6,7,7,0,6,...,0,8,85,1,6,0,1,31.48,-6,6.0
3,1,8-Sep,Sun,2002,New York Jets,0,17,3,11,6,...,1,10,82,4,8,2,2,39.13,-6,-3.0
4,1,8-Sep,Sun,2002,Arizona Cardinals,10,3,3,7,-1,...,0,7,56,6,10,1,2,34.4,8,6.0


In [8]:
scores_data.shape

(3471, 82)

- Oddly, to read other types of delimited data, we also use `read_csv()`!
  + Specify the `sep =` argument

- `chemical.txt` file (space delimiter) stored at "https://www4.stat.ncsu.edu/~online/datasets/chemical.txt"

In [11]:
chem_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/chemical.txt", sep=" ")
chem_data.head()

Unnamed: 0,temp,conc,time,percent
0,-1.0,-1.0,-1.0,45.9
1,1.0,-1.0,-1.0,60.6
2,-1.0,1.0,-1.0,57.5
3,1.0,1.0,-1.0,58.6
4,-1.0,-1.0,1.0,53.3


- `crabs.txt` file (tab delimiter) stored at "https://www4.stat.ncsu.edu/~online/datasets/crabs.txt"
  + Tab is `\t`

In [12]:
crabs_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/crabs.txt", sep="\t")
crabs_data.head()

Unnamed: 0,color,spine,width,satell,weight,y
0,3,3,28.3,8,3050,1
1,4,3,22.5,0,1550,0
2,2,1,26.0,9,2300,1
3,4,3,24.8,0,2100,0
4,4,3,26.0,4,2600,1


- `umps2012.txt` file (`>` delimiter) stored at "https://www4.stat.ncsu.edu/~online/datasets/umps2012.txt"
  + No column names in raw file
  + Can specify `header = None` and give column names when reading (via `names = [list of names]`)
    

In [13]:
ump_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/umps2012.txt",
                      sep=">",
                      header=None,
                      names=["Year", "Month", "Day", "Home", "Away", "HPUmpire"])
ump_data.head()

Unnamed: 0,Year,Month,Day,Home,Away,HPUmpire
0,2012,4,12,MIN,LAA,D.J. Reyburn
1,2012,4,12,SD,ARI,Marty Foster
2,2012,4,12,WSH,CIN,Mike Everitt
3,2012,4,12,PHI,MIA,Jeff Nelson
4,2012,4,12,CHC,MIL,Fieldin Culbreth


---

## Reading Excel Data

- Use the `ExcelFile()` function from `pandas`

- `censusEd.xlsx` file located at "https://www4.stat.ncsu.edu/~online/datasets/censusEd.xlsx"

In [16]:
ed_data = pd.ExcelFile("https://www4.stat.ncsu.edu/~online/datasets/censusEd.xlsx")
ed_data

<pandas.io.excel._base.ExcelFile at 0x7d6ef1627880>

- Unfortunately, there are different attributes associated with this data object!

In [17]:
#ed_data.head(), ed_data.info() won't work!
type(ed_data)

In [18]:
ed_data.sheet_names

['EDU01A',
 'EDU01B',
 'EDU01C',
 'EDU01D',
 'EDU01E',
 'EDU01F',
 'EDU01G',
 'EDU01H',
 'EDU01I',
 'EDU01J']

- Use `.parse()` method with sheet to obtain a usual `DataFrame`

In [19]:
ed_data.parse('EDU01A').head()

  warn("""Cannot parse header or footer so it will be ignored""")


Unnamed: 0,Area_name,STCOU,EDU010187F,EDU010187D,EDU010187N1,EDU010187N2,EDU010188F,EDU010188D,EDU010188N1,EDU010188N2,...,EDU010194N1,EDU010194N2,EDU010195F,EDU010195D,EDU010195N1,EDU010195N2,EDU010196F,EDU010196D,EDU010196N1,EDU010196N2
0,UNITED STATES,0,0,40024299,0,0,0,39967624,0,0,...,0,0,0,43993459,0,0,0,44715737,0,0
1,ALABAMA,1000,0,733735,0,0,0,728234,0,0,...,0,0,0,727989,0,0,0,736825,0,0
2,"Autauga, AL",1001,0,6829,0,0,0,6900,0,0,...,0,0,0,7568,0,0,0,7834,0,0
3,"Baldwin, AL",1003,0,16417,0,0,0,16465,0,0,...,0,0,0,19961,0,0,0,20699,0,0
4,"Barbour, AL",1005,0,5071,0,0,0,5098,0,0,...,0,0,0,5017,0,0,0,5053,0,0


- Alternatively, use the `read_excel()` function from `pandas`

- This reads things in to a standard `DataFrame` but you have to specify a sheet to read in (or it defaults to the 1st)

- `censusEd.xlsx` file located at "https://www4.stat.ncsu.edu/~online/datasets/censusEd.xlsx"

In [20]:
ed_data = pd.read_excel("https://www4.stat.ncsu.edu/~online/datasets/censusEd.xlsx",
                        sheet_name = 0) #or "EDU01A"
ed_data.head()

  warn("""Cannot parse header or footer so it will be ignored""")


Unnamed: 0,Area_name,STCOU,EDU010187F,EDU010187D,EDU010187N1,EDU010187N2,EDU010188F,EDU010188D,EDU010188N1,EDU010188N2,...,EDU010194N1,EDU010194N2,EDU010195F,EDU010195D,EDU010195N1,EDU010195N2,EDU010196F,EDU010196D,EDU010196N1,EDU010196N2
0,UNITED STATES,0,0,40024299,0,0,0,39967624,0,0,...,0,0,0,43993459,0,0,0,44715737,0,0
1,ALABAMA,1000,0,733735,0,0,0,728234,0,0,...,0,0,0,727989,0,0,0,736825,0,0
2,"Autauga, AL",1001,0,6829,0,0,0,6900,0,0,...,0,0,0,7568,0,0,0,7834,0,0
3,"Baldwin, AL",1003,0,16417,0,0,0,16465,0,0,...,0,0,0,19961,0,0,0,20699,0,0
4,"Barbour, AL",1005,0,5071,0,0,0,5098,0,0,...,0,0,0,5017,0,0,0,5053,0,0


- You can read **all sheets** with `sheet_name = None`
- This gets read into a dictionary!
  + Keys are the sheet name
  + Values are the `DataFrame`s from each sheet

In [22]:
ed_data = pd.read_excel("https://www4.stat.ncsu.edu/~online/datasets/censusEd.xlsx",
                        sheet_name = None)
type(ed_data)

  warn("""Cannot parse header or footer so it will be ignored""")


dict

In [23]:
ed_data.keys()

dict_keys(['EDU01A', 'EDU01B', 'EDU01C', 'EDU01D', 'EDU01E', 'EDU01F', 'EDU01G', 'EDU01H', 'EDU01I', 'EDU01J'])

In [25]:
ed_data.get("EDU01A").head() #get one DataFrame using its key!

Unnamed: 0,Area_name,STCOU,EDU010187F,EDU010187D,EDU010187N1,EDU010187N2,EDU010188F,EDU010188D,EDU010188N1,EDU010188N2,...,EDU010194N1,EDU010194N2,EDU010195F,EDU010195D,EDU010195N1,EDU010195N2,EDU010196F,EDU010196D,EDU010196N1,EDU010196N2
0,UNITED STATES,0,0,40024299,0,0,0,39967624,0,0,...,0,0,0,43993459,0,0,0,44715737,0,0
1,ALABAMA,1000,0,733735,0,0,0,728234,0,0,...,0,0,0,727989,0,0,0,736825,0,0
2,"Autauga, AL",1001,0,6829,0,0,0,6900,0,0,...,0,0,0,7568,0,0,0,7834,0,0
3,"Baldwin, AL",1003,0,16417,0,0,0,16465,0,0,...,0,0,0,19961,0,0,0,20699,0,0
4,"Barbour, AL",1005,0,5071,0,0,0,5098,0,0,...,0,0,0,5017,0,0,0,5053,0,0


---

## Reading JSON Data

- JSON data has a structure similar to a dictionary

    + Key-value pairs

```
{  
  {  
    "name": "Barry Sanders"  
    "games" : 153  
    "position": "RB"  
  },  
  {  
    "name": "Joe Montana"  
    "games": 192  
    "position": "QB"  
  }  
}
```

- `read_json()` function from `pandas` will work!
- Read in data from URL: "https://api.exchangerate-api.com/v4/latest/USD"

In [26]:
url = "https://api.exchangerate-api.com/v4/latest/USD"
usd_data = pd.read_json(url)
usd_data.head()

Unnamed: 0,provider,WARNING_UPGRADE_TO_V6,terms,base,date,time_last_updated,rates
USD,https://www.exchangerate-api.com,https://www.exchangerate-api.com/docs/free,https://www.exchangerate-api.com/terms,USD,2025-01-03,1735862402,1.0
AED,https://www.exchangerate-api.com,https://www.exchangerate-api.com/docs/free,https://www.exchangerate-api.com/terms,USD,2025-01-03,1735862402,3.67
AFN,https://www.exchangerate-api.com,https://www.exchangerate-api.com/docs/free,https://www.exchangerate-api.com/terms,USD,2025-01-03,1735862402,70.58
ALL,https://www.exchangerate-api.com,https://www.exchangerate-api.com/docs/free,https://www.exchangerate-api.com/terms,USD,2025-01-03,1735862402,94.12
AMD,https://www.exchangerate-api.com,https://www.exchangerate-api.com/docs/free,https://www.exchangerate-api.com/terms,USD,2025-01-03,1735862402,396.72


Every API is different so contacting them and returning data differs depending on the website you use. This is an important way we obtain data these days!

---

# Writing Out a `DataFrame`

- We can write out data using the `.to_csv()` method

In [27]:
usd_data.to_csv('usd_data.csv', index = False) #this goes to the folder on the left in colab!

---

# Quick Video

This video shows some examples of reading in some SAS data and data from an API.

Remember to pop the video out into the full player.

The notebook written in the video is [available here](https://colab.research.google.com/github/jbpost2/ST-554-Big-Data-with-Python/blob/main/01_Programming_in_python/Learning_Python.ipynb).

In [None]:
from IPython.display import IFrame
IFrame(src="https://ncsu.hosted.panopto.com/Panopto/Pages/Embed.aspx?id=30db871f-1750-4e5d-9de8-b0ff000c225c&autoplay=false&offerviewer=true&showtitle=true&showbrand=true&captions=false&interactivity=all", height="405", width="720")

---

# Recap

- Read data in with `pandas`  

    + `read_csv()` for delimited data
    
    + `ExcelFile()` or `read_excel()` for excel data
    
    + `read_json()` for JSON
    
- Write to a `.csv` with `.to_csv()`

If you are on the course website, use the table of contents on the left or the arrows at the bottom of this page to navigate to the next learning material!

If you are on Google Colab, head back to our course website for [our next lesson](https://jbpost2.github.io/ST-554-Big-Data-with-Python/01_Programming_in_python/17-Numerical_Summaries.html)!