<a href="https://colab.research.google.com/github/marshall-kirk/CSC310/blob/main/03_accessing_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
###### Set Up #####
# verify our folder with the data and module assets is installed
# if it is installed make sure it is the latest
!test -e ds-assets && cd ds-assets && git pull && cd ..
# if it is not installed clone it
!test ! -e ds-assets && git clone https://github.com/IndraniMandal/ds-assets.git
# point to the folder with the assets
home = "ds-assets/assets/"
import sys
sys.path.append(home)      # add home folder to module search path

Already up to date.


# Reading and Publishing

Colab notebooks are webbased resources.  Consequently all data access is webbased via URLs and [REST APIs](https://restfulapi.net/). Here we will tell you how to read from various data sources around the web and we will show you one way how to publish your own data as part of a notebook. Finally, we will also show you how to submit your assignments in BrightSpace.  Reading data from SQL databases is discussed separately in a later section.


## Reading Data from the CSC310 'Assets' Repository

The CSC310 `assets` repository contains many of the data sets and resources needed for this course. The setup code above clones the `assets` repository into our local Colab VM environment and into the folder called `ds-assets/assets`.  You need to include this setup code in your own notebooks so that your own notebooks have access to the CSC310 resources. Once the setup code is executed, we can list what's in that folder by issuing the command,

In [None]:
!ls ds-assets/assets

2fold-xval.png		   mlp_regression2.py
5fold-xval.png		   mlp_regression.py
abalone.csv		   model-performance-curves.png
bootstrap.py		   newsgroups.csv
caesarian.csv		   newsgroups-noheaders.csv
cars.csv		   PandasPythonForDataScience.jpg
classification1.jpg	   PandasPythonForDataScience.pdf
classification2.jpg	   pdf-badge.png
classification3.jpg	   perceptron-eq.jpg
colab-badge.afdesign	   perceptron.jpg
colab-icon.afdesign	   perceptron.r
colab-icon.png		   perceptron-search.png
confint.py		   perceptron-train.jpg
confusion1.png		   pipeline.png
confusion2.png		   __pycache__
crohnd.csv		   regression1.jpg
cross-validated-curve.png  rs.png
data-science.jpg	   shuttle.csv
divorce.csv		   shuttle.pdf
divorce-readme.txt	   sobar-72.csv
elbow.py		   swans.jpg
github-icon.png		   tennis.csv
google_drive.py		   tennis_numeric.csv
grid-stability.csv	   training-curves.jpg
helloagain.py		   train-test-curves.png
helloworld.py		   train-test-data.png
iris.csv		   tree-model.png
kmeans-steps.

The variable `home` defined by the setup code points to that folder and we can use it to read files from that folder.  For example, above you can see that there is a file called `tennis.csv`.  Let's read that file into a Pandas dataframe,

In [None]:
import pandas
df = pandas.read_csv(home+"tennis.csv")
df.head()

Unnamed: 0,outlook,temp,humidity,windy,play
0,sunny,hot,high,False,no
1,sunny,hot,high,True,no
2,overcast,hot,high,False,yes
3,rainy,mild,high,False,yes
4,rainy,cool,normal,False,yes


## Reading Files from your Google Drive

Google Drive is not a hierarchical file system.  It simply stores files via unique identifiers.  You can see these unique identifiers when you look at
the "share link" of a file.  For example, I have a file called `iris-local.csv` on my Google Drive.  The "share link" is
```
https://drive.google.com/file/d/1U9bYx5tQd4ZYLQvTSt11EHTammg2zwoL/view?usp=sharing
```
The long string of seemingly random numbers and letters is the unique file id. We can use the share link to read the file into our notebook.  In order to do that we use a module called `google_drive` available in our `assets` folder.  The `assets` folder has been added to the Python path by the setup coder so we can just import the `google_drive` module.  In particular, we import the function `downloadlink` that converts a share link into a link that points at Google Drive's REST API for downloading files.  The following code converts the share link for my `iris-local.csv` file into a downloadable URL and then reads the file from my Google Drive,

In [None]:
import pandas
from google_drive import downloadlink
sharelink = "https://drive.google.com/file/d/1U9bYx5tQd4ZYLQvTSt11EHTammg2zwoL/view?usp=sharing"
url = downloadlink(sharelink) # convert the share link into a url pointing at the REST API
df = pandas.read_csv(url)
df.head()

Unnamed: 0,id,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,1,5.1,3.5,1.4,0.2,setosa
1,2,4.9,3.0,1.4,0.2,setosa
2,3,4.7,3.2,1.3,0.2,setosa
3,4,4.6,3.1,1.5,0.2,setosa
4,5,5.0,3.6,1.4,0.2,setosa


## Reading Data right from the Web

Some websites make data file directly available for download.  One such website is [Vincent Arel-Bundock's data set collection](https://vincentarelbundock.github.io/Rdatasets/).  If we follow the link to the `html` index we find a collection of datasets in `csv` format.  One such dataset is the credit card dataset (\#21) with the corresponding link to the `csv` file,
```
https://vincentarelbundock.github.io/Rdatasets/csv/AER/CreditCard.csv
```
We can read this data right from that website into our notebook.  There is no need to make a local copy of the data,

In [None]:
import pandas
url = "https://vincentarelbundock.github.io/Rdatasets/csv/AER/CreditCard.csv"
df = pandas.read_csv(url)
df.head()

Unnamed: 0.1,Unnamed: 0,card,reports,age,income,share,expenditure,owner,selfemp,dependents,months,majorcards,active
0,1,yes,0,37.66667,4.52,0.03327,124.9833,yes,no,3,54,1,12
1,2,yes,0,33.25,2.42,0.005217,9.854167,no,no,3,34,1,13
2,3,yes,0,33.66667,4.5,0.004156,15.0,yes,no,4,58,1,5
3,4,yes,0,30.5,2.54,0.065214,137.8692,no,no,0,25,1,7
4,5,yes,0,32.16667,9.7867,0.067051,546.5033,yes,no,2,64,1,5


## Reading Data from GitHub Repositories

GitHub maintains the `raw.githubusercontent.com` domain that allows users to access files in repositories unprocessed.
For example, consider the file `tennis.csv` in my GitHub repository `ds-assets` with the following parameters,

* Account: `IndraniMandal`
* Repository: `ds-assets`
* Branch: `main`
* Folder: `assets`
* File: `tennis.csv`

Perhaps the only surprising thing here is the 'Branch' parameter.  For most file accesses we are interested in the main branch of the repository which is called either  the `master` or  the `main` branch.  The main branch in my repository is called `main`.  Given this information we can construct a raw access URL to the `tennis.csv` file using the scheme,
```
https://raw.githubusercontent.com/<account>/<repository>/<branch>/<folder>/<filename>
```
This gives us the URL,
```
https://raw.githubusercontent.com/IndraniMandal/ds-assets/main/assets/tennis.csv
```
Let's try this with some code,

In [None]:
import pandas
url = "https://raw.githubusercontent.com/IndraniMandal/ds-assets/main/assets/tennis.csv"
df = pandas.read_csv(url)
df.head()

Unnamed: 0,outlook,temp,humidity,windy,play
0,sunny,hot,high,False,no
1,sunny,hot,high,True,no
2,overcast,hot,high,False,yes
3,rainy,mild,high,False,yes
4,rainy,cool,normal,False,yes


## Publishing your Dataset/Notebook

Your Google Drive acts like a webserver for (data) files.  If you make a file shareable then anybody with the link can access that file from the web.  This is especially important when you want to collaborate with team members on a notebook.  In that case, one of the team members should be declared the host of the datafile and the notebook.  Both the datafile and notebook will need to be made shareable by the host and the link of the notebook shared with the team members.

Let's say I want this notebook to be shared with my team members and the datafile we want to use is my `iris-local.csv` file.  I would share the link of this notebook with my team members and in the notebook I would point to the data file we want to analyze.  Something like this,

In [None]:
import pandas
from google_drive import downloadlink
# sharelink points to my iris file
sharelink ="https://drive.google.com/file/d/1U9bYx5tQd4ZYLQvTSt11EHTammg2zwoL/view?usp=sharing"
url = downloadlink(sharelink) # convert the share link into a url pointing at the REST API
df = pandas.read_csv(url)
(rows,cols) = df.shape
print("The data set has {} rows and {} columns".format(rows,cols))

The data set has 150 rows and 6 columns


Since both my notebook at my dataset are shareable my team members will be able execute the notebook and see the same analysis that I see.  They will also be able to edit my notebook thereby contributing to the analysis of this dataset.  Use the `Share` button on the top, right corner of your notebook to set the share permissions.  I would recommend setting the permissions to `anybody with this link` and allow full editing capabilities for collaboration.

## Submitting your Work to BrightSpace

Submitting your work is BrightSpace follows pretty much the same patterns as in the previous section of publishing your dataset/notebook.  You have to get a share link using the `Share` button on the upper right corner of your notebook that you want to submit.  I would again suggest setting the permissions to `anybody with this link` and allow for editing.  Once you have the share link simply paste that into the appropriate field in the BrightSpace assignment.

If you are submitting a notebook that was created as a team then only one team member needs to submit a link.  It should probably be the host. Make sure that the names of the team members are given in the notebook.