## Overview of Tools and Reading Data

This notebook provides an overview of the tools we will use in this course as we start reading in data using the CSGenome APIs. 


### Python:
* tutorial: https://www.csc2.ncsu.edu/faculty/healey/msa-17/python/index.html  



In [None]:
# This is a code cell
# This is how we can comment our code directly in the cell
print("Hours in a week", 24 * 7)

Hours in a week 168


### Diving into Jupyter Notebooks:

* We will write code in code cells and write comments and explanation like this in markdown cells (this cell is a markdown cell)
    * Click on the pen for this cell to go into edit mode and see how "###" and "*" are used to format the text to create a header and bullet points.
    * Markdown Reference: https://help.github.com/articles/basic-writing-and-formatting-syntax/ 
  

### Let's practice using Jupyter Notebooks!

Step 1: Insert a new code cell below this cell by using the "+ Code" option(upper right of this window, or it will appear if you mouse over the space just below this cell)  and a new Markdown cell by using the "+ Text" option.

Step 2: In the first cell you inserted, print out the version of python that you have installed by typing these two lines of code: 

    import sys
    sys.version

Then, run this cell block using the menu bar, run icon, or keyboard shortcut

Step 3: In the markdown cell, write your name, PID, major, graduation year, and the version of python that you see being used. Your name and PID should be on the first line and be an h1 header. Your major should be on the second line and be an h2 header. Your graduation year should be on the third line and be an h3 header. The version of python you have should be on the fourth line and should not be a header. Run this cell.

Step 4: If you inserted any extra cells you can delete them using the trash can icon.

### Now, let's move on to something more interesting, like reading in data!

Before we can read in the data, we have to **import** the required libraries. Run the code below to use these libraries further down
- **pandas** is an open source library with easy-to-use data structures and data analysis tools for Python
- **requests** is a library for sending HTTP requests across the internet easily
- **json_normalize** takes data in JSON format and stores it in table of data 

In [None]:
import pandas as pd # allows us to access pandas using 'pd'
import requests as re # same idea
from pandas.io.json import json_normalize

Let's retrieve the Top500 Benchmark data from the CSGenome API 

API: https://en.wikipedia.org/wiki/Application_programming_interface

view Top500 Benchmark data in web browser: https://csgenome.org/api/benchmarks/top500

In [None]:
# we set the limit to the total number of entries so that we get the complete database in 2 request
url = 'https://csgenome.org/api/benchmarks/top500' 

Let's convert the data from JSON format into a DataFrame

DataFrame: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html 
* Primary pandas data structure
* Two-dimensional, size-mutable (size can be changed), potentially heterogeneous tabular data
* Basically, a table with rows, columns, and values and a really useful data structure!
* Contain labeled axes (rows and columns) 
* n = cardinality (rows)
* p = dimensionality (columns)
* index = row labels 
* columns have names and data types
* Can read CSV files into DataFrames and export DataFrames to CSV files

In [None]:
# Returns normalized data with columns prefixed with the given string
top_500 = pd.json_normalize(re.get(url).json()['data'])

In [None]:
# prints the DataFrame
top_500

Unnamed: 0,computer,date,power_measured_size,r_max,top500_rank,n_half,n_max,r_peak,year,information_source,power,bmark_type,id,system.processor,system.id
0,Numerical Wind Tunnel,1995/11|1998/11|1996/06|2001/11|1996/11|1998/0...,,170.0|229.0|170.0|229.0|229.0|229.0|229.0|229....,1.0|23.0|2.0|130.0|2.0|15.0|106.0|7.0|1.0|35.0...,13800.0|18018.0|13800.0|18018.0|18018.0|18018....,42000.0|66132.0|42000.0|66132.0|66132.0|66132....,235.79|281.26|279.58|281.26|281.26|281.26|281....,1993|1996|1996|1996|1996|1996|1996|1996|1993|1...,SUPER500_1995/11|SUPER500_1998/11|SUPER500_199...,|||||||||||||||||,top500,15115,,15115
1,XP/S140,1995/11|1998/11|1996/06|1996/11|1998/06|1997/1...,,143.4|143.4|143.4|143.4|143.4|143.4|143.4|143....,2.0|36.0|3.0|4.0|28.0|18.0|1.0|2.0|12.0|2.0,20500.0|20500.0|20500.0|20500.0|20500.0|20500....,55700.0|55700.0|55700.0|55700.0|55700.0|55700....,184.0|184.0|184.0|184.0|184.0|184.0|184.0|184....,1993|1993|1993|1993|1993|1993|1993|1993|1993|1993,SUPER500_1995/11|SUPER500_1998/11|SUPER500_199...,|||||||||,top500,15116,,15116
2,XP/S-MP 150,1995/11|1998/11|1996/06|1996/11|1998/06|1997/1...,,127.1|127.1|127.1|127.1|127.1|127.1|127.1|127....,3.0|39.0|4.0|5.0|30.0|24.0|52.0|70.0|15.0|3.0,17800.0|17800.0|17800.0|17800.0|17800.0|17800....,86000.0|86000.0|86000.0|86000.0|86000.0|86000....,154.0|154.0|154.0|154.0|154.0|154.0|154.0|154....,1995|1995|1995|1995|1995|1995|1995|1995|1995|1995,SUPER500_1995/11|SUPER500_1998/11|SUPER500_199...,|||||||||,top500,15117,,15117
3,T3D MC1024-8,1995/11|1998/11|1996/06|2001/11|1996/11|1998/0...,,100.5|100.5|100.5|100.5|100.5|100.5|100.5|100....,4.0|57.0|6.0|393.0|7.0|42.0|271.0|32.0|70.0|10...,10224.0|10224.0|10224.0|10224.0|10224.0|10224....,81920.0|81920.0|81920.0|81920.0|81920.0|81920....,153.6|153.6|153.6|153.6|153.6|153.6|153.6|153....,1994|1994|1994|1994|1994|1994|1994|1994|1994|1...,SUPER500_1995/11|SUPER500_1998/11|SUPER500_199...,|||||||||||||,top500,15118,,15118
4,VPP500/80,1995/11|1998/11|1996/06|1996/11|1998/06|1997/1...,,98.9|109.0|98.9|98.9|109.0|109.0|109.0|109.0|1...,5.0|49.0|7.0|8.0|38.0|29.0|64.0|90.0|18.0|5.0,10050.0|11030.0|10050.0|10050.0|11030.0|11030....,32640.0|46400.0|32640.0|32640.0|46400.0|46400....,128.0|128.0|128.0|128.0|128.0|128.0|128.0|128....,1994|1994|1994|1994|1994|1994|1994|1994|1994|1994,SUPER500_1995/11|SUPER500_1998/11|SUPER500_199...,|||||||||,top500,15119,,15119
5,SP2/512,1995/11|1996/06|1996/11|1994/11|1997/06|1995/06,,88.4|88.4|88.4|12.1|88.4|44.2,6.0|8.0|14.0|54.0|26.0|11.0,20150.0|20150.0|20150.0|6200.0|20150.0|13500.0,73500.0|73500.0|73500.0|27000.0|73500.0|53000.0,136.19|136.19|136.19|136.19|136.19|136.19,1994|1994|1994|1994|1994|1994,SUPER500_1995/11|SUPER500_1996/06|SUPER500_199...,|||||,top500,15120,,15120
6,SP2/512,1995/11|1996/06|1996/11,,88.4|88.4|88.4,7.0|9.0|15.0,20150.0|20150.0|20150.0,73500.0|73500.0|73500.0,136.19|136.19|136.19,1995|1995|1995,SUPER500_1995/11|SUPER500_1996/06|SUPER500_199...,||,top500,15121,,15121
7,SP2/384,1995/11|1996/06|1996/11|1997/11|1997/06,,66.3|66.3|66.3|66.3|66.3,8.0|12.0|16.0|45.0|35.0,0.0|0.0|0.0|0.0|0.0,0.0|0.0|0.0|0.0|0.0,102.14|102.14|102.14|102.14|102.14,1994|1994|1994|1994|1994,SUPER500_1995/11|SUPER500_1996/06|SUPER500_199...,||||,top500,15122,,15122
8,SX-4/32,1995/11|1998/11|1996/06|1996/11|1998/06|1997/1...,,60.72|61.7|66.53|60.6|61.7|61.7|61.7|61.7|60.6...,9.0|77.0|11.0|18.0|58.0|48.0|103.0|231.0|39.0|...,0.0|1688.0|1792.0|1560.0|1688.0|1688.0|1688.0|...,0.0|20480.0|15360.0|10000.0|20480.0|20480.0|20...,64.0|64.0|64.0|64.0|64.0|64.0|64.0|64.0|64.0|64.0,1995|1995|1995|1995|1995|1995|1995|1995|1995|1995,SUPER500_1995/11|SUPER500_1998/11|SUPER500_199...,|||||||||,top500,15123,,15123
9,CM-5/1056,1995/11|1998/11|1996/06|1996/11|1998/06|1994/0...,,59.7|59.7|59.7|59.7|59.7|59.7|59.7|59.7|59.7,10.0|80.0|13.0|21.0|61.0|3.0|3.0|43.0|6.0,24064.0|24064.0|24064.0|24064.0|24064.0|24064....,52224.0|52224.0|52224.0|52224.0|52224.0|52224....,135.17|135.17|135.17|135.17|135.17|135.17|135....,1993|1993|1993|1993|1993|1993|1993|1993|1993,SUPER500_1995/11|SUPER500_1998/11|SUPER500_199...,||||||||,top500,15124,,15124


We can see the index values start at 0 and end at 49, so there are 50 rows in the dataframe (cardinality=50). We can see the first column is the computer column and the last column is the system.id column. If we counted the number of columns, we would find that there are 15 columns (dimension=15). Let's use DataFrame.shape to print the size of the dataframe.

In [None]:
top_500.shape

(50, 15)

Let's check out the data types of the columns using DataFrame.dtypes

In [None]:
top_500.dtypes

computer               object
date                   object
power_measured_size    object
r_max                  object
top500_rank            object
n_half                 object
n_max                  object
r_peak                 object
year                   object
information_source     object
power                  object
bmark_type             object
id                      int64
system.processor       object
system.id               int64
dtype: object

Many of the columns are object data types. Columns with mixed types are stored with the object dtype, so some columns with values that appear to contain values that are integers or strings (text) actually contain values with mixed data types.

The top500 API also has three parameters that we can pass in arguments for. The three parameters are:
* page: type - int; description - page of collection
* limit: type - int; description - number of items to return (default: 50)
* columns: type - comma-separated list; description - subset of columns to return
Notice that the default limit is 50. Since we did not pass in an argument for the limit parameter, our DataFrame has 50 rows. Let's try passing in arguments to the parameters. We will pass in 25000 as the limit, 1 as the page, and r_max, r_peak, and n_max for the columns.

In [None]:
# using a \ after the first line to break up the long line
url_param = 'https://csgenome.org/api/benchmarks/top500?\
page=1&limit=25000&columns=r_max,r_peak,n_max'

In [None]:
top_500_2 = pd.json_normalize(re.get(url_param).json()['data'])

In [None]:
top_500_2

Unnamed: 0,r_max,r_peak,n_max
0,170.0|229.0|170.0|229.0|229.0|229.0|229.0|229....,235.79|281.26|279.58|281.26|281.26|281.26|281....,42000.0|66132.0|42000.0|66132.0|66132.0|66132....
1,143.4|143.4|143.4|143.4|143.4|143.4|143.4|143....,184.0|184.0|184.0|184.0|184.0|184.0|184.0|184....,55700.0|55700.0|55700.0|55700.0|55700.0|55700....
2,127.1|127.1|127.1|127.1|127.1|127.1|127.1|127....,154.0|154.0|154.0|154.0|154.0|154.0|154.0|154....,86000.0|86000.0|86000.0|86000.0|86000.0|86000....
3,100.5|100.5|100.5|100.5|100.5|100.5|100.5|100....,153.6|153.6|153.6|153.6|153.6|153.6|153.6|153....,81920.0|81920.0|81920.0|81920.0|81920.0|81920....
4,98.9|109.0|98.9|98.9|109.0|109.0|109.0|109.0|1...,128.0|128.0|128.0|128.0|128.0|128.0|128.0|128....,32640.0|46400.0|32640.0|32640.0|46400.0|46400....
...,...,...,...
8030,1.955,2.4,1900.0
8031,1.955,2.4,1900.0
8032,1.955,2.4,1900.0
8033,1.955,2.4,1900.0


Now, we can see that this dataframe contains 8035 rows and the three columns that we entered for the columns parameter. Since there are so many rows, let's print the first couple of rows using dataframe.head().

dataframe.head(): https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html

- function that returns the first n rows of a dataframe, the default value for n is 5

In [None]:
top_500_2.head()

Unnamed: 0,r_max,r_peak,n_max
0,170.0|229.0|170.0|229.0|229.0|229.0|229.0|229....,235.79|281.26|279.58|281.26|281.26|281.26|281....,42000.0|66132.0|42000.0|66132.0|66132.0|66132....
1,143.4|143.4|143.4|143.4|143.4|143.4|143.4|143....,184.0|184.0|184.0|184.0|184.0|184.0|184.0|184....,55700.0|55700.0|55700.0|55700.0|55700.0|55700....
2,127.1|127.1|127.1|127.1|127.1|127.1|127.1|127....,154.0|154.0|154.0|154.0|154.0|154.0|154.0|154....,86000.0|86000.0|86000.0|86000.0|86000.0|86000....
3,100.5|100.5|100.5|100.5|100.5|100.5|100.5|100....,153.6|153.6|153.6|153.6|153.6|153.6|153.6|153....,81920.0|81920.0|81920.0|81920.0|81920.0|81920....
4,98.9|109.0|98.9|98.9|109.0|109.0|109.0|109.0|1...,128.0|128.0|128.0|128.0|128.0|128.0|128.0|128....,32640.0|46400.0|32640.0|32640.0|46400.0|46400....


Note, that by printing the head of the dataframe we did not change the top_500_2 dataframe to only contain the first five rows. If we want to change a dataframe, we have to assign the existing dataframe to the new dataframe. However, a good practice is to assign top_500_2.head() to a new dataframe object. This way, we save the original dataframe. For example:

In [None]:
# creates new dataframe, top_500_head
top_500_head = top_500_2.head()
top_500_head

Unnamed: 0,r_max,r_peak,n_max
0,170.0|229.0|170.0|229.0|229.0|229.0|229.0|229....,235.79|281.26|279.58|281.26|281.26|281.26|281....,42000.0|66132.0|42000.0|66132.0|66132.0|66132....
1,143.4|143.4|143.4|143.4|143.4|143.4|143.4|143....,184.0|184.0|184.0|184.0|184.0|184.0|184.0|184....,55700.0|55700.0|55700.0|55700.0|55700.0|55700....
2,127.1|127.1|127.1|127.1|127.1|127.1|127.1|127....,154.0|154.0|154.0|154.0|154.0|154.0|154.0|154....,86000.0|86000.0|86000.0|86000.0|86000.0|86000....
3,100.5|100.5|100.5|100.5|100.5|100.5|100.5|100....,153.6|153.6|153.6|153.6|153.6|153.6|153.6|153....,81920.0|81920.0|81920.0|81920.0|81920.0|81920....
4,98.9|109.0|98.9|98.9|109.0|109.0|109.0|109.0|1...,128.0|128.0|128.0|128.0|128.0|128.0|128.0|128....,32640.0|46400.0|32640.0|32640.0|46400.0|46400....


And we still have the original dataframe.

In [None]:
top_500_2

Unnamed: 0,r_max,r_peak,n_max
0,170.0|229.0|170.0|229.0|229.0|229.0|229.0|229....,235.79|281.26|279.58|281.26|281.26|281.26|281....,42000.0|66132.0|42000.0|66132.0|66132.0|66132....
1,143.4|143.4|143.4|143.4|143.4|143.4|143.4|143....,184.0|184.0|184.0|184.0|184.0|184.0|184.0|184....,55700.0|55700.0|55700.0|55700.0|55700.0|55700....
2,127.1|127.1|127.1|127.1|127.1|127.1|127.1|127....,154.0|154.0|154.0|154.0|154.0|154.0|154.0|154....,86000.0|86000.0|86000.0|86000.0|86000.0|86000....
3,100.5|100.5|100.5|100.5|100.5|100.5|100.5|100....,153.6|153.6|153.6|153.6|153.6|153.6|153.6|153....,81920.0|81920.0|81920.0|81920.0|81920.0|81920....
4,98.9|109.0|98.9|98.9|109.0|109.0|109.0|109.0|1...,128.0|128.0|128.0|128.0|128.0|128.0|128.0|128....,32640.0|46400.0|32640.0|32640.0|46400.0|46400....
...,...,...,...
8030,1.955,2.4,1900.0
8031,1.955,2.4,1900.0
8032,1.955,2.4,1900.0
8033,1.955,2.4,1900.0


Another way we could think about it is to copy the top_500_2 dataframe using DataFrame.copy().

In [None]:
top_500_og = top_500_2.copy()
top_500_2 = top_500_2.head()
top_500_2

Unnamed: 0,r_max,r_peak,n_max
0,170.0|229.0|170.0|229.0|229.0|229.0|229.0|229....,235.79|281.26|279.58|281.26|281.26|281.26|281....,42000.0|66132.0|42000.0|66132.0|66132.0|66132....
1,143.4|143.4|143.4|143.4|143.4|143.4|143.4|143....,184.0|184.0|184.0|184.0|184.0|184.0|184.0|184....,55700.0|55700.0|55700.0|55700.0|55700.0|55700....
2,127.1|127.1|127.1|127.1|127.1|127.1|127.1|127....,154.0|154.0|154.0|154.0|154.0|154.0|154.0|154....,86000.0|86000.0|86000.0|86000.0|86000.0|86000....
3,100.5|100.5|100.5|100.5|100.5|100.5|100.5|100....,153.6|153.6|153.6|153.6|153.6|153.6|153.6|153....,81920.0|81920.0|81920.0|81920.0|81920.0|81920....
4,98.9|109.0|98.9|98.9|109.0|109.0|109.0|109.0|1...,128.0|128.0|128.0|128.0|128.0|128.0|128.0|128....,32640.0|46400.0|32640.0|32640.0|46400.0|46400....


And the original dataframe is now top_500_og. 

In [None]:
top_500_og

Unnamed: 0,r_max,r_peak,n_max
0,170.0|229.0|170.0|229.0|229.0|229.0|229.0|229....,235.79|281.26|279.58|281.26|281.26|281.26|281....,42000.0|66132.0|42000.0|66132.0|66132.0|66132....
1,143.4|143.4|143.4|143.4|143.4|143.4|143.4|143....,184.0|184.0|184.0|184.0|184.0|184.0|184.0|184....,55700.0|55700.0|55700.0|55700.0|55700.0|55700....
2,127.1|127.1|127.1|127.1|127.1|127.1|127.1|127....,154.0|154.0|154.0|154.0|154.0|154.0|154.0|154....,86000.0|86000.0|86000.0|86000.0|86000.0|86000....
3,100.5|100.5|100.5|100.5|100.5|100.5|100.5|100....,153.6|153.6|153.6|153.6|153.6|153.6|153.6|153....,81920.0|81920.0|81920.0|81920.0|81920.0|81920....
4,98.9|109.0|98.9|98.9|109.0|109.0|109.0|109.0|1...,128.0|128.0|128.0|128.0|128.0|128.0|128.0|128....,32640.0|46400.0|32640.0|32640.0|46400.0|46400....
...,...,...,...
8030,1.955,2.4,1900.0
8031,1.955,2.4,1900.0
8032,1.955,2.4,1900.0
8033,1.955,2.4,1900.0


### Practice reading in data and using dataframes

Step 1: Read in the data from the from the CSGenome API and use different arguments for the parameters. Convert the data from JSON format into a dataframe and call the dataframe top_500_df. 

Step 2: Print the size and data types of the top_500_df dataframe

Step 3: Create a new dataframe that is equal to the first 10 rows of top_500_df. Save the original top_500_df dataframe. 