# Unit 5 Lecture 1 -  Loading Data

ESI4628: Decision Support Systems for Industrial Engineers<br>
University of Central Florida
Dr. Ivan Garibay, Ramya Akula, Mostafa Saeidi, Madeline Schiappa, Brett Belcher, Jonathan A. 
https://github.com/igaribay/DSSwithPython/blob/master/DSS-Week05/Notebook/DSS-Unit05-Lecture01.ipynb

## Notebook Learning Objectives
After studying this notebook students should be able to:
- Load structured and unstructured data into Python
- Load data from files or URLs in CSV, Text, and JSON formats into Pandas Series and DataFrame
- Write a Dataframe into a file using CSV, Text and JSON formats

# Overview

The very first step in any data science project is _data ingestion_. Data could be structured (SQL tables, CSV files, Excel files) or unstructured. The standard for unstructured data is a file format called JSON (JavaScript Object Notation). There is also important to be able to "write" your results into these formats for other people to reproduce, validate your results.

# Reading data in different formats:

-```read_csv``` load data from file or URL. Use comma as delimiter.

-```read_table``` load data from file or URL. Use tab ('\t') as default delimiter.

In [219]:
import pandas as pd

## ```read_csv```
We use <code>read_csv</code> to create a Panda DataFrame from an external _Comma-Separated Value (CSV)_ formated data file. For instance, see the example below, where a CSV file called __housing_dataset.csv__ is loaded using this method

In [220]:
csv_path = 'https://s3.amazonaws.com/dss-fall2018/housing_dataset.csv'
df = pd.read_csv (csv_path)
df.tail()

Unnamed: 0,SalePrice,LotFrontage,LotArea,OverallQual,MasVnrArea,YearBuilt,BsmtUnfSF,YearRemodAdd,TotalBsmtSF,BsmtFinSF1,1stFlrSF
1190,0.194556,0.140411,0.030929,0.555556,0.0,0.92029,0.407962,0.833333,0.155974,0.0,0.142038
1191,0.243161,0.219178,0.055505,0.555556,0.074375,0.768116,0.25214,0.633333,0.252373,0.139972,0.399036
1192,0.321622,0.15411,0.036187,0.666667,0.0,0.5,0.375428,0.933333,0.188543,0.048724,0.195961
1193,0.148903,0.160959,0.039342,0.444444,0.0,0.565217,0.0,0.766667,0.176432,0.008682,0.170721
1194,0.156367,0.184932,0.04037,0.444444,0.0,0.673913,0.058219,0.25,0.205565,0.147059,0.211565


One of the nice features of these data-reading functions such as <code>read_csv</code> is _Type Inference_. This means that we do not have to specity which columns are numeric, strings, etc.

In [221]:
df.dtypes

SalePrice       float64
LotFrontage     float64
LotArea         float64
OverallQual     float64
MasVnrArea      float64
YearBuilt       float64
BsmtUnfSF       float64
YearRemodAdd    float64
TotalBsmtSF     float64
BsmtFinSF1      float64
1stFlrSF        float64
dtype: object

We can also read a local CSV file. The "address" of the file is relative to where this Notebook is stored. Lets read a local file stored at "../Data/DSS-Data01-Demographic_Statistics_By_Zip_Code.csv". This file was downloaded originaly from: https://catalog.data.gov/dataset?res_format=CSV and contains public demographic information from the City of New York by Zip code.

In [222]:
csv_path = '../Data/DSS-Data01-Demographic_Statistics_By_Zip_Code.csv'
df2 = pd.read_csv (csv_path)
df2.tail()

Unnamed: 0,JURISDICTION NAME,COUNT PARTICIPANTS,COUNT FEMALE,PERCENT FEMALE,COUNT MALE,PERCENT MALE,COUNT GENDER UNKNOWN,PERCENT GENDER UNKNOWN,COUNT GENDER TOTAL,PERCENT GENDER TOTAL,...,COUNT CITIZEN STATUS TOTAL,PERCENT CITIZEN STATUS TOTAL,COUNT RECEIVES PUBLIC ASSISTANCE,PERCENT RECEIVES PUBLIC ASSISTANCE,COUNT NRECEIVES PUBLIC ASSISTANCE,PERCENT NRECEIVES PUBLIC ASSISTANCE,COUNT PUBLIC ASSISTANCE UNKNOWN,PERCENT PUBLIC ASSISTANCE UNKNOWN,COUNT PUBLIC ASSISTANCE TOTAL,PERCENT PUBLIC ASSISTANCE TOTAL
231,12788,83,39,0.47,44,0.53,0,0,83,100,...,83,100,35,0.42,48,0.58,0,0,83,100
232,12789,272,115,0.42,157,0.58,0,0,272,100,...,272,100,70,0.26,202,0.74,0,0,272,100
233,13731,17,2,0.12,15,0.88,0,0,17,100,...,17,100,7,0.41,10,0.59,0,0,17,100
234,16091,0,0,0.0,0,0.0,0,0,0,0,...,0,0,0,0.0,0,0.0,0,0,0,0
235,20459,0,0,0.0,0,0.0,0,0,0,0,...,0,0,0,0.0,0,0.0,0,0,0,0


Lets select a small subset from data above to continue with our examples.

In [223]:
df3 = df2.loc[0:235,["JURISDICTION NAME","COUNT PARTICIPANTS", "COUNT FEMALE", "COUNT MALE"]]
df3.tail()

Unnamed: 0,JURISDICTION NAME,COUNT PARTICIPANTS,COUNT FEMALE,COUNT MALE
231,12788,83,39,44
232,12789,272,115,157
233,13731,17,2,15
234,16091,0,0,0
235,20459,0,0,0


# Writing DataFrame into a CSV File 
Using <code>.to_csv</code>, we can now write this new dataset into a CSV file as follows:

In [224]:
df3.to_csv('../Data/my_output_file.csv')

__Note:__ You should be able to access your file system and find the file _../Data/my_output_file.csv_ on your computer. Try open it with Excel or a text editing program.

Now we can load this csv file and introduce some modification while importing it, such as changing index or including multiple indexes

In [225]:
csv_path = '../Data/my_output_file.csv'
df4 = pd.read_csv (csv_path)
df4.tail()
#df4

Unnamed: 0.1,Unnamed: 0,JURISDICTION NAME,COUNT PARTICIPANTS,COUNT FEMALE,COUNT MALE
231,231,12788,83,39,44
232,232,12789,272,115,157
233,233,13731,17,2,15
234,234,16091,0,0,0
235,235,20459,0,0,0


This DataFrame resulted with a repeated column due to writing index fist, then loading with a new index. Lets now export the DataFrame but without index or colum headers:

In [226]:
df3.to_csv('../Data/my_output_file2.csv', index=False, header=False)

In [227]:
csv_path = '../Data/my_output_file2.csv'
df5 = pd.read_csv (csv_path)
df5.head()
#df5

Unnamed: 0,10001,44,22,22.1
0,10002,35,19,16
1,10003,1,1,0
2,10004,0,0,0
3,10005,2,2,0
4,10006,6,2,4


Loading our newly created CSV file results in erroneously making the first raw of data the "header". We know this data does not have a header, so we can let python assing any column header or we can assign column headers to data as follows:

In [228]:
df5 = pd.read_csv (csv_path, header=None) # no header option, Python assign headers 
df5.head()

Unnamed: 0,0,1,2,3
0,10001,44,22,22
1,10002,35,19,16
2,10003,1,1,0
3,10004,0,0,0
4,10005,2,2,0


In [229]:
df5 = pd.read_csv (csv_path, names=['Zip Code', 'People','Female','Male' ])
df5.head()

Unnamed: 0,Zip Code,People,Female,Male
0,10001,44,22,22
1,10002,35,19,16
2,10003,1,1,0
3,10004,0,0,0
4,10005,2,2,0


## ```read_table```
We use <code>read_table</code> to create a Panda DataFrame from an external _Tab Separated Values_ formated data file (default). We can also specify the charater (or characters) that  are separating the data values in the file we want to load, for instance '|' or any other charater or sequence used as separator. In the example below we work with a TXT file: __SampleTextFile.txt__.

In [230]:
Text = pd.read_table ('https://s3.amazonaws.com/dss-fall2018/SampleTextFile.txt')
Text


Unnamed: 0,"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus condimentum sagittis lacus, laoreet luctus ligula laoreet ut. Vestibulum ullamcorper accumsan velit vel vehicula. Proin tempor lacus arcu. Nunc at elit condimentum, semper nisi et, condimentum mi. In venenatis blandit nibh at sollicitudin. Vestibulum dapibus mauris at orci maximus pellentesque. Nullam id elementum ipsum. Suspendisse cursus lobortis viverra. Proin et erat at mauris tincidunt porttitor vitae ac dui."
0,"Donec vulputate lorem tortor, nec fermentum ni..."
1,"Nulla luctus sem sit amet nisi consequat, id o..."
2,Vestibulum ante ipsum primis in faucibus orci ...
3,"Etiam vitae accumsan augue. Ut urna orci, male..."
4,"Integer eu hendrerit diam, sed consectetur nun..."
5,Mauris nec metus vel dolor blandit faucibus et...
6,Quisque venenatis justo sit amet tortor condim...
7,"Phasellus fringilla luctus magna, a finibus ju..."
8,"Maecenas turpis enim, consectetur eget lectus ..."
9,Sed consequat mi at maximus faucibus. Pellente...


In [231]:
Text = pd.read_table ('https://s3.amazonaws.com/dss-fall2018/SampleTextFile.txt')
#Text
Text.tail()

Unnamed: 0,"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus condimentum sagittis lacus, laoreet luctus ligula laoreet ut. Vestibulum ullamcorper accumsan velit vel vehicula. Proin tempor lacus arcu. Nunc at elit condimentum, semper nisi et, condimentum mi. In venenatis blandit nibh at sollicitudin. Vestibulum dapibus mauris at orci maximus pellentesque. Nullam id elementum ipsum. Suspendisse cursus lobortis viverra. Proin et erat at mauris tincidunt porttitor vitae ac dui."
158,Interdum et malesuada fames ac ante ipsum prim...
159,"Cras pellentesque enim a quam dapibus, sed tin..."
160,"Phasellus a nisl malesuada, pharetra dui sit a..."
161,Fusce tincidunt dictum tempor. Mauris nec tell...
162,Sed tristique auctor tellus id facilisis. Quis...


Since the text does not contain 'tab's the entire row of text is loaded as a single data field. Also, these file does not contain headers, so lets import it again but this time: (A) lets specidy a header "text" 

In [232]:
Text = pd.read_table ('https://s3.amazonaws.com/dss-fall2018/SampleTextFile.txt', names=['text'])
#Text
Text.tail()

Unnamed: 0,text
159,Interdum et malesuada fames ac ante ipsum prim...
160,"Cras pellentesque enim a quam dapibus, sed tin..."
161,"Phasellus a nisl malesuada, pharetra dui sit a..."
162,Fusce tincidunt dictum tempor. Mauris nec tell...
163,Sed tristique auctor tellus id facilisis. Quis...


For some __very large data sets__ it would not be possible (or convenient) to load the entire data into a DataFrame at once. For this cases, we can select to just load few of the first rows of data using <code>nrows=</code>

In [233]:
Text = pd.read_table ('https://s3.amazonaws.com/dss-fall2018/SampleTextFile.txt',names=['text'], nrows = 5)
Text

Unnamed: 0,text
0,"Lorem ipsum dolor sit amet, consectetur adipis..."
1,"Donec vulputate lorem tortor, nec fermentum ni..."
2,"Nulla luctus sem sit amet nisi consequat, id o..."
3,Vestibulum ante ipsum primis in faucibus orci ...
4,"Etiam vitae accumsan augue. Ut urna orci, male..."


# Dealing with Missing Data


The important point of reading files with any format is, considering missing data. pandas automatically fills missing data by returning NA or NULL.

The best way to check whether a ```DataFrame``` has any NaN values is by using ```.isnull``` function. 

In [234]:
pd.isnull(Text) 

Unnamed: 0,text
0,False
1,False
2,False
3,False
4,False


# Reading and Writing Unstructured Data 
The most populat format to handling unstructured data is __JSON__ (JavaScript Object Notation).
JSON files can be used to store structured or unstructured data. Unstructured data contain elements that often do not all contain the same data. They also contain data that is "nested" and not in a tabular format.

## Reading a simple JSON object

<code>.read_json</code> will read a JSON object from a file and return a Pandas Series or DataFame (default). For example, lets read the following JSON object from __"../Data/sample_load_json1.json"__

{<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"fruit": {"0", "Apple"},<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"size": {"1", "Large"},<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"color": {"2", "Red"]<br>
}<br>

In [238]:
df6 = pd.read_json("../Data/sample_load_json1.json") #load JSON from file to DataFrame df6
df6

Unnamed: 0,color,fruit,size
0,Red,Apple,Large


## Creating a simple JSON object from a DataFrame
<code>.to_json</code> method cerates a JSON object from a DataFrame as follows:

In [239]:
my_json_object = df6.to_json() # variable my_json_object
print my_json_object

{"color":{"0":"Red"},"fruit":{"0":"Apple"},"size":{"0":"Large"}}


We can also directly write the JSON object into a file:

In [240]:
df6.to_json("../Data/sample_write_json2.json") #writing JSON to a file
df7=pd.read_json("../Data/sample_write_json2.json") #loading from file and displaying
df7

Unnamed: 0,color,fruit,size
0,Red,Apple,Large


# Exercises

__1.__ Read entire data from http://bit.ly/chiporders into a DataFame. How many rows does this dataset has? Display only the first rows of this data. How many "NaN" are in the column "choice_description"?

__Solution:__ First point your brower to http://bit.ly/chiporders in order to find out what format the data is on. In this case it is in text separated by TABs. So, we use <code>.read_table</code>:

In [241]:
pd.read_table('http://bit.ly/chiporders') # reading the dataset

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98
5,3,1,Chicken Bowl,"[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...",$10.98
6,3,1,Side of Chips,,$1.69
7,4,1,Steak Burrito,"[Tomatillo Red Chili Salsa, [Fajita Vegetables...",$11.75
8,4,1,Steak Soft Tacos,"[Tomatillo Green Chili Salsa, [Pinto Beans, Ch...",$9.25
9,5,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...",$9.25


In [242]:
order = pd.read_table('http://bit.ly/chiporders') #before manipulating the data, 
                                                  #we store it into a dataframe "order"
order.head()  # use .head() to return only the first rows

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98


In [243]:
order.isnull().sum() # use ".isnull" to return "TRUE" when null, use .sum() to add them

order_id                 0
quantity                 0
item_name                0
choice_description    1246
item_price               0
dtype: int64

__2.__ Read entire data from http://bit.ly/movieusers into a DataFame. Load data into the following columns: 'id','age','gender','job','zip_code'. What is the average age of movie goers? How many on this dataset are females?

__Solution:__ First point your brower to http://bit.ly/movieusers in order to find out what format the data is on. In this case it is in text separated by "|". So, we use <code>.read_table</code> with attribute <code>sep='|'</code>. If we use the default "TAB" separator this is what we get:

In [244]:
pd.read_table('http://bit.ly/movieusers')

Unnamed: 0,1|24|M|technician|85711
0,2|53|F|other|94043
1,3|23|M|writer|32067
2,4|24|M|technician|43537
3,5|33|F|other|15213
4,6|42|M|executive|98101
5,7|57|M|administrator|91344
6,8|36|M|administrator|05201
7,9|29|M|student|01002
8,10|53|M|lawyer|90703
9,11|39|F|other|30329


This is not what we want. The data has been all compressed into a single column. Using the correct separator we obtain:

In [245]:
pd.read_table('http://bit.ly/movieusers', sep='|')

Unnamed: 0,1,24,M,technician,85711
0,2,53,F,other,94043
1,3,23,M,writer,32067
2,4,24,M,technician,43537
3,5,33,F,other,15213
4,6,42,M,executive,98101
5,7,57,M,administrator,91344
6,8,36,M,administrator,05201
7,9,29,M,student,01002
8,10,53,M,lawyer,90703
9,11,39,F,other,30329


Result looks better than before and each of the fields are in their own column.

The other issue is that the first row is not header row, so we add the headers as follows:

In [246]:
header_name = ['id','age','gender','job','zip_code']

movie_df =pd.read_table('http://bit.ly/movieusers', sep='|', header = None, names = header_name)
movie_df

Unnamed: 0,id,age,gender,job,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213
5,6,42,M,executive,98101
6,7,57,M,administrator,91344
7,8,36,M,administrator,05201
8,9,29,M,student,01002
9,10,53,M,lawyer,90703


In [247]:
movie_df['age'].mean() # average age of movie goers

34.05196182396607

In [248]:
Female_count = 0     # variable to count number of females movie goers
for x in movie_df['gender']: # for each element on column "gender" in dataset
    if x=='F': Female_count += 1 # if element is 'F' then increase counter
Female_count         # final number

273

__3.__ Go to the Data.gov catalog of CSV files at https://catalog.data.gov/dataset?res_format=CSV . Pick a dataset in CSV. Download and import into a DataFrame df1. How many rows did you imported? Display the first rows of the dataset. Now take df1 and save it into a local file in CSV. Load the file you just saved into datarfame df2. Compare df1 and df2. Are they identical? why or why not?

__4.__ Go to the Data.gov catalog of JSON files at https://catalog.data.gov/dataset?res_format=JSON . Pick a dataset in JSON. Download and import a subset of the JSON obeject into a DataFrame df1. Feel free to pick what subset, JSON objects are complex and for this exercise you just want to keep it simple. How many rows did you imported? Display the first rows of the dataset. Now take df1 and save it into a local file in JSON. Load the file you just saved into datarfame df2. Compare df1 and df2. Are they identical? why or why not?

# Homework (not graded)
Please complete all the exercises on this Notebook. Some exercises will be solved in class, but you should complete solving all the remaining exersices at the end of each Notebook on every class. If you can not solve an exercise, please contact the class teaching assistant for help inmmediately.

# References

1. "Python and JSON: Working with large datasets using Pandas" https://www.dataquest.io/blog/python-json-tutorial/
2. Loading JSON, https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html

_Last updated on 9.20.18 2:24am<br>
(C) 2018 Complex Adaptive Systems Laboratory all rights reserved._