<a href="https://colab.research.google.com/github/mentaltraveller/Data-Science/blob/main/Data_retrieval_worksheet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data retrieval
---

Examples of data being retrieved from a range of sources


## From a web page
---

The code below reads all the data tables from the Wikipedia page on Glasgow.  The 8th table on the page shows population data over a period of centuries.

The code reads the data from the page into a list of datatables.  The index [7] is used to access the 8th table in the list.  Change the index to see other data tables.  Use len(datatables) to find out how many tables are in the list.

In [None]:
import pandas as pd
datatables = pd.read_html('https://en.wikipedia.org/wiki/Glasgow#Climate')
df = len(datatables)
df

25

## From a csv file hosted on Github.com
---

The code below reads the data table stored in a Comma Separated Values file (this is a text file containing rows of data with each column within the row separated from the next column by a comma).  

If you were using Jupyter Notebooks on your device, the url could be replaced with the path to the CSV file.

In [None]:
import pandas as pd
url = "https://raw.githubusercontent.com/futureCodersSE/working-with-data/main/Data%20sets/Paisley-Weather-Data.csv"
df = pd.read_csv(url)
df

Unnamed: 0,yyyy,mm,tmax (degC),tmin (degC),af (days),rain (mm),sun (hours),status
0,1959,1,4,-2,25,40.9,54.1,
1,1959,2,6.6,2.1,10,41.8,17.8,
2,1959,3,10.6,4.2,0,50.9,85.7,
3,1959,4,13,5.2,0,76.3,125.1,
4,1959,5,18.1,7.9,0,24,222,
...,...,...,...,...,...,...,...,...
741,2020,10,12.9*,7.1*,0*,185.3*,76.8*,Provisional
742,2020,11,10.6*,6.0*,0*,142.4*,29.3*,Provisional
743,2020,12,6.9*,2.6*,8*,131.0*,31.6*,Provisional
744,2021,1,4.9*,-0.2*,14*,132.2*,51.0*,Provisional


## From an Excel file hosted on Github.com
---

The code below reads the data table from a sheet in an Excel file.  If you don't specify a sheet then it will assume that you want to read the data from the first sheet in the Excel workbook (sheet_name = 0).  If you don't know the sheet name but know it is the second sheet, you can use sheet_name = 1, or 2 for the third sheet, etc.

If you were using Jupyter Notebooks on your device, the url could be replace with the path to the Excel file.

In [None]:
import pandas as pd
url = "https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true"
df = pd.read_excel(url,sheet_name="Industry Migration")
df

## From an API which delivers the data in JSON format
---

The code below requests the data from the url.  This is a bit more tricky than the other ways to get the data as how you access the data will depend on how it is organised.

In this example, the data is returned as a dictionary, which will have the key 'data' against which the actual data is stored.  In the example, the data has been taken from the 'data' key/value pair and is stored in json_data. 

Again, in this example, the json_data is a list of json_objects but it only has one object in the list.  Try adding the line `print(json_data)` to see this.  

data_table is the first object in the json_data list.  Try adding the line `print(data_table)` to see this.

In this example, the data table object has three keys, 'to', 'from' and 'regions'.  The 'regions' value is the data we want to use in our dataframe, so we normalize this json data into a pandas dataframe (df), which you can see as the output.  

Each API is likely to deliver its data in a different format and so you will need to be happy to read the documentation and to inspect the data to see what keys and indexes you need to access.

For information on the format of the data, see https://carbon-intensity.github.io/api-definitions/#regional

In [None]:
import pandas as pd
import requests

url = "https://api.carbonintensity.org.uk/regional"
json_data = requests.get(url).json()['data']
data_table = json_data[0]
df = pd.json_normalize(data_table['regions'])
df
#print(json_data)


Unnamed: 0,regionid,dnoregion,shortname,generationmix,intensity.forecast,intensity.index
0,1,Scottish Hydro Electric Power Distribution,North Scotland,"[{'fuel': 'biomass', 'perc': 0}, {'fuel': 'coa...",162,moderate
1,2,SP Distribution,South Scotland,"[{'fuel': 'biomass', 'perc': 2.4}, {'fuel': 'c...",105,low
2,3,Electricity North West,North West England,"[{'fuel': 'biomass', 'perc': 7.7}, {'fuel': 'c...",130,low
3,4,NPG North East,North East England,"[{'fuel': 'biomass', 'perc': 37.6}, {'fuel': '...",128,low
4,5,NPG Yorkshire,Yorkshire,"[{'fuel': 'biomass', 'perc': 36.4}, {'fuel': '...",225,moderate
5,6,SP Manweb,North Wales & Merseyside,"[{'fuel': 'biomass', 'perc': 2.6}, {'fuel': 'c...",218,moderate
6,7,WPD South Wales,South Wales,"[{'fuel': 'biomass', 'perc': 0}, {'fuel': 'coa...",346,high
7,8,WPD West Midlands,West Midlands,"[{'fuel': 'biomass', 'perc': 5.2}, {'fuel': 'c...",248,moderate
8,9,WPD East Midlands,East Midlands,"[{'fuel': 'biomass', 'perc': 8.9}, {'fuel': 'c...",277,high
9,10,UKPN East,East England,"[{'fuel': 'biomass', 'perc': 0}, {'fuel': 'coa...",109,low


## Exercise - upload a CSV file to your own Github account and write Python code to load it into a dataframe
---

1.  Download the CSV file at this link to your downloads folder on your computer: https://drive.google.com/file/d/15vDkpCKqlRHQt8f8VHER97fIqZytIAtu/view?usp=sharing 

2.  Create a folder called 'Data Sets' and move the CSV file into the Data Sets folder.

3.  Log in to your Github account and navigate to the repository where you are uploading all your Colab Worksheets

4.  Click on the Add File button  
![Add file to Github](https://drive.google.com/uc?id=1szQpVcLg56yPPJc6z4wvK9mCGzSNSa5q)  Select *Upload Files* and then drag the Data Sets folder onto the page (drag the folder rather than the files in it).  Once the folder has uploaded, you will need to commit the changes.  Scroll down the page to see the Commit Changes button.  Before you commit, you can add an extended description *e.g. New folder to store data sets.*

5.  Click to open the Data Sets folder in your Github repository.  Then click to select the file `housing_in_london_yearly_variables.csv`.  You will need a link to the 'raw' data version of this file.  
![Get raw data](https://drive.google.com/uc?id=1_B9_1YK35eRpXp5kN2zBZRu0m_CIBn5i)  
Click on 'raw', you will see just the data shown in the page.  Select the URL for this page and copy it.  **This is the link you will need.**  

You can now refer to the section above 'From a csv file hosted on Github.com' and create a dataframe from your newly uploaded CSV file.

### Note: 
for future data set uploads you will just need to navigate to the Data Sets folder in your Github repository and click on Add File there.  You can then just upload the file and it will sit in the Data Sets folder.


In [None]:
import pandas as pd
url = 'https://raw.githubusercontent.com/mentaltraveller/Data-Science/main/Data%20Sets/housing_in_london_yearly_variables.csv'
df = pd.read_csv(url)
df.head(10)

Unnamed: 0,code,area,date,median_salary,life_satisfaction,mean_salary,recycling_pct,population_size,number_of_jobs,area_size,no_of_houses,borough_flag
0,E09000001,city of london,1999-12-01,33020.0,,48922,0,6581.0,,,,1
1,E09000002,barking and dagenham,1999-12-01,21480.0,,23620,3,162444.0,,,,1
2,E09000003,barnet,1999-12-01,19568.0,,23128,8,313469.0,,,,1
3,E09000004,bexley,1999-12-01,18621.0,,21386,18,217458.0,,,,1
4,E09000005,brent,1999-12-01,18532.0,,20911,6,260317.0,,,,1
5,E09000006,bromley,1999-12-01,16720.0,,21293,13,294902.0,,,,1
6,E09000007,camden,1999-12-01,23677.0,,30249,13,190003.0,,,,1
7,E09000008,croydon,1999-12-01,19563.0,,22205,13,332066.0,,,,1
8,E09000009,ealing,1999-12-01,20580.0,,25046,12,302252.0,,,,1
9,E09000010,enfield,1999-12-01,19289.0,,21006,9,272731.0,,,,1
