# Data structures and data formats
In Python we use mainly dictionaries to store data that has non trivial structure. Dictionary is a Python programming language data structure - a way in which data is stored in memory that includes methods that we can use (lookup, enumeration, etc.).

However when we transfer data we need a way to format, serialize a structure usually using a simple text. This is because we will not always be communicating with Python. This is where the data format comes in. It is a way of formatting data in a well defined way so that we can transmit both structure and values in a reliable way.

Currently the two most popular data formats are probably JSON and XML.

## JSON
JSON stands for JavaScript Object Notation. It is an open standard file format ant despite its name is independent from JavaScript. JSON is from a Python programmer perspective a very good looking format as it looks exactly the same as Python dictionaries. Lets look at a simple example (lets copy it straight from wiki: https://en.wikipedia.org/wiki/JSON).

In [4]:
jsonString = '''
{
  "firstName": "John",
  "lastName": "Smith",
  "isAlive": true,
  "age": 27,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021-3100"
  },
  "phoneNumbers": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "office",
      "number": "646 555-4567"
    },
    {
      "type": "mobile",
      "number": "123 456-7890"
    }
  ],
  "children": [],
  "spouse": null
}
'''

Now jsonString is a python string variable that contains text that is JSON formatted. 

In [5]:
import json
jsonDict =json.loads(jsonString)
print(type(jsonString))
print(type(jsonDict))
print(jsonDict)

<class 'str'>
<class 'dict'>
{'firstName': 'John', 'lastName': 'Smith', 'isAlive': True, 'age': 27, 'address': {'streetAddress': '21 2nd Street', 'city': 'New York', 'state': 'NY', 'postalCode': '10021-3100'}, 'phoneNumbers': [{'type': 'home', 'number': '212 555-1234'}, {'type': 'office', 'number': '646 555-4567'}, {'type': 'mobile', 'number': '123 456-7890'}], 'children': [], 'spouse': None}


As we can see in the example below (in phoneNumbers) it looks as if JSON format also supports lists. Lets make sure how it looks.

In [3]:
print(type(jsonDict["phoneNumbers"]))
print(jsonDict["phoneNumbers"])

<class 'list'>
[{'type': 'home', 'number': '212 555-1234'}, {'type': 'office', 'number': '646 555-4567'}, {'type': 'mobile', 'number': '123 456-7890'}]


We can reverse the procedure and go from python dictionary to json string using json.dumps. We cane even define indent level to make it more human readable.

In [4]:
print(type(json.dumps(jsonDict)))
print(json.dumps(jsonDict, indent=1))

<class 'str'>
{
 "firstName": "John",
 "lastName": "Smith",
 "isAlive": true,
 "age": 27,
 "address": {
  "streetAddress": "21 2nd Street",
  "city": "New York",
  "state": "NY",
  "postalCode": "10021-3100"
 },
 "phoneNumbers": [
  {
   "type": "home",
   "number": "212 555-1234"
  },
  {
   "type": "office",
   "number": "646 555-4567"
  },
  {
   "type": "mobile",
   "number": "123 456-7890"
  }
 ],
 "children": [],
 "spouse": null
}


## XML
The other format that has many applications and usage is XML. There are some similarities between JSON and XML but they are very different. The main difference comes from the fact that JSON is only a data format and XML is a markup language. Therefore it has many more elements and therefore possibilities. We need to remember that those possibilities come with a price - drop in performance and increase in filesize. So when you work on a project where performance is vital make sure that you really need xml.

Understanding XML and JSON fully requires a much deeper look. These links are a good way to start if you are interested:
* http://www.yegor256.com/2015/11/16/json-vs-xml.html
* https://stackoverflow.com/questions/4862310/json-and-xml-comparison

Lets see how XML cann look like. Lets re-write our JSON example. XML needs one tree root and than it can have as many branches and sub branches as one wants.

In [5]:
xmlString = '''
<person>
    <firstName att1="valueAtt1" at2="anyValue">
    "John"
    </firstName>
    <lastName attr2="anyValue">
    "Smith"
    </lastName>
    <isAlive>
    true
    </isAlive>
    <age>
    27
    </age>
    <address>
        <streetAddress>"21 2nd Street"
        </streetAddress>
        <city>"New York"
        </city>
        <state>"NY"
        </state>
        <postalCode>"10021-3100"
        </postalCode>
    </address>
    <children>
    </children>
    <spouse>
    null
    </spouse>
</person>
'''

Now we can start working with our XML. however it is not so easy to convert it to Python dictionary. It is also not recommended as we loose some XML functionalities in this process, like attributes. Python has a XML library that has a module called ElementTree that allows us to work with XML.

In [6]:
import xml.etree.ElementTree as ET
root = ET.fromstring(xmlString)

In [7]:
for branch in root:
#     print(branch.keys)
    print(branch.tag, branch.attrib, branch.text)

firstName {'att1': 'valueAtt1', 'at2': 'anyValue'} 
    "John"
    
lastName {'attr2': 'anyValue'} 
    "Smith"
    
isAlive {} 
    true
    
age {} 
    27
    
address {} 
        
children {} 
    
spouse {} 
    null
    


jsonString = '''
{
  "firstName": "John",
  "lastName": "Smith",
  "isAlive": true,
  "age": 27,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021-3100"
  },
  "phoneNumbers": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "office",
      "number": "646 555-4567"
    },
    {
      "type": "mobile",
      "number": "123 456-7890"
    }
  ],
  "children": [],
  "spouse": null
}
'''

XML tree structure allows us to browse our document using XPath. It allows us to call parent, child element with ease. From a data scientist perspective XML resemblance HTML due to its tree like structure, where we will see how we can use XPath in web crawling. We need to remember that HTML and XML are not the same.
There are also other libraries for xml like lxml. If we use ETree API the code looks exactly the same.

In [8]:
from lxml import etree
root = etree.fromstring(xmlString)
for branch in root:
#     print(branch.keys)
    print(branch.tag, branch.attrib, branch.text)

firstName {'att1': 'valueAtt1', 'at2': 'anyValue'} 
    "John"
    
lastName {'attr2': 'anyValue'} 
    "Smith"
    
isAlive {} 
    true
    
age {} 
    27
    
address {} 
        
children {} 
    
spouse {} 
    null
    


## Webscrapping - API and Requests
The idea of web scrapping is simple. We want to autmate browsing the internet and gather the data we are interested in in a stuctured way. First element of web scrapping is establishing a connection. There are many libraries that can serve this need (curl, urllib, requests, mechanize). For simple procedures requests seems like a good place to start.

The simplest form of webscrapping is making API requests. Lets look at an example that asks a CrossRef API about publication from an author (https://github.com/CrossRef/rest-api-doc).

In [1]:
import requests

In [6]:
url = 'https://api.crossref.org/works?'
headers = {'User-Agent': 'ScientificReserach/1.0 (http://coin.wne.uw.edu.pl/mwilamowski; mailto:mwilamowski@wne.uw.edu.pl) '}
query = 'query=Mikolaj+Czajkowski'
print(query)
r = requests.get(url+query, headers=headers)
pubs = json.loads(r.text)
json.loads(r.text)

query=Mikolaj+Czajkowski


{'status': 'ok',
 'message-type': 'work-list',
 'message-version': '1.0.0',
 'message': {'facets': {},
  'total-results': 4892,
  'items': [{'indexed': {'date-parts': [[2024, 12, 12]],
     'date-time': '2024-12-12T00:10:04Z',
     'timestamp': 1733962204534,
     'version': '3.30.2'},
    'posted': {'date-parts': [[2024]]},
    'group-title': 'SSRN',
    'reference-count': 68,
    'publisher': 'Elsevier BV',
    'content-domain': {'domain': [], 'crossmark-restriction': False},
    'DOI': '10.2139/ssrn.4903251',
    'type': 'posted-content',
    'created': {'date-parts': [[2024, 7, 23]],
     'date-time': '2024-07-23T18:33:42Z',
     'timestamp': 1721759622000},
    'source': 'Crossref',
    'is-referenced-by-count': 0,
    'title': ['The Effects of Emotions on Stated Preferences for Environmental Change: A Re-Examination'],
    'prefix': '10.2139',
    'author': [{'given': 'YiLong',
      'family': 'Xu',
      'sequence': 'first',
      'affiliation': []},
     {'given': 'Mikolaj',
  

In [None]:
pubs["message"]["items"][1]

Lets say we want to scrap the results of all running results in Dutch half-marathon (http://evenementen.uitslagen.nl/2016/enschedemarathon/uitslag.php?on=1&p=1&tl=nl).

In [8]:
r = requests.get('http://evenementen.uitslagen.nl/2016/enschedemarathon/uitslag.php?on=1&p=1&tl=nl')

In [None]:
r.text

Well now we have the source code of this website but it is not very friendly or human readable. Well there is no web scrapping without some knowledge of HTML. It looks however that there is a nice html table with results.

There is a very good library that allows us to parse html. We can use it to find all the tables in the html and than look for the table that we are interested in. 

In [10]:
from bs4 import BeautifulSoup
bs = BeautifulSoup(r.text)
table =bs.findAll(lambda tag: tag.name=='table' and tag.has_attr('class'))

It looks as if table "1" is the one we are looking for.

In [None]:
table[1]

Well it looks like another problem to solve. Now we have a table that we were looking for but it is HTML and not JSON os CSV or anything useful. Thankfully everything is easy in python.

In [11]:
import pandas as pd
pd.read_html(str(table[1]), header=0)[0].head()

  pd.read_html(str(table[1]), header=0)[0].head()


Unnamed: 0,Plts,StNr,Naam,Woonplaats,Land,Uitsl,Categ,Bruto,Netto,Unnamed: 9
0,1,3,Sarah Jebet,Kenia,,1e,Vsen,2:27:59,2:27:59,
1,2,4,Priscah Jepleting,Kenia,,2e,Vsen,2:29:08,2:29:06,
2,3,6,Rose Jepchumba,Kenia,,3e,Vsen,2:29:09,2:29:09,
3,4,393,David Stevens,Roosdaal,,1e,Msen,2:31:34,2:31:33,
4,5,5,Eunice Jeptoo,Kenia,,4e,Vsen,2:32:36,2:32:35,


That was easy. All we have to do now is go through all the pages in the website. All we have to do in this case is walk through all the pages in a a loop just changing the page number in the link.

In [12]:
for k in range(1,5):
    print('http://evenementen.uitslagen.nl/2016/enschedemarathon/uitslag.php?on=1&p={0}&tl=nl'.format(k))

http://evenementen.uitslagen.nl/2016/enschedemarathon/uitslag.php?on=1&p=1&tl=nl
http://evenementen.uitslagen.nl/2016/enschedemarathon/uitslag.php?on=1&p=2&tl=nl
http://evenementen.uitslagen.nl/2016/enschedemarathon/uitslag.php?on=1&p=3&tl=nl
http://evenementen.uitslagen.nl/2016/enschedemarathon/uitslag.php?on=1&p=4&tl=nl


What about different running event? Lets look at Ottawa half-marathon (https://www.sportstats.ca/display-results.xhtml?raceid=26001)

In [None]:
r = requests.get('https://www.sportstats.ca/display-results.xhtml?raceid=26001')
r.text

## Webscrapping - Selenium webdriver
Again everything looks fine until we will try to get to the next page. It uses JavaScript to load results from next page. We cannot get to it using requests. What can we do?

This is were selenium comes in. Selenium is a library that helps us to do automated testing or crawling. It is a library that can control a browser if only it has a proper driver for it. For Firefox we will need geckodriver (https://github.com/mozilla/geckodriver/releases). Make sure that you have proper version.

For Chrome https://googlechromelabs.github.io/chrome-for-testing/

In [17]:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from time import sleep

In [20]:
#driver = webdriver.Firefox(r"C:\Users\mwilamowski\Desktop\enPython", timeout=1)
# driver.get('https://www.sportstats.ca/display-results.xhtml?raceid=26001')
# sleep(1)

In [19]:
# On windows, you should also add it to system PATH variable, for examle using 'setx PATH "%PATH%;C:\path\to\chromedriver"'
#service = Service("C:\\WebDriver\\chromedriver.exe") 

service = Service("/usr/local/bin/chromedriver")
driver = webdriver.Chrome(service=service)

In [20]:
driver.get('https://www.sportstats.ca/display-results.xhtml?raceid=26001')
sleep(1)

Now we can look at the code of the website. It looks as if there is just one element on the website that uses fa-angle-right class to show as next page arrow. We can use it to click on the next page.

In [None]:
nextPage = driver.find_element(By.CLASS_NAME, "fa-angle-right")
nextPage.click()

The problem with this solution is that the "fa-angle-right" arrow appears even on last page. So we should check what is the current page number and compare it with previous page number. Than we want to keep going as long as the number keep going up.

In [23]:
pageSource = driver.page_source
bs = BeautifulSoup(pageSource, "lxml")
currPage = int(bs.findAll("li", { "class" : "active" })[0].a.contents[0])
print(currPage)

1


We should build a while loop around a code that looks like this:

In [24]:
pageCount=0
prevPage=0
currPage=1
pageSource = driver.page_source
bs = BeautifulSoup(pageSource)
table =bs.find(lambda tag: tag.name=='table' and tag.has_attr('class'))
try:
    currPage = int(bs.findAll("li", { "class" : "active" })[0].a.contents[0])
    print(currPage)
except:
    print("nieznalezlem numeru strony")

if currPage > prevPage:
    readPages(bs, pages)
    try:
        nextPage = driver.find_element_by_class_name("fa-angle-right")
    except:
        break
    nextPage.click()
    sleep(7)
    pageCount+=1
    prevPage=currPage

1




 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


SyntaxError: 'break' outside loop (<ipython-input-24-d5c47cd101cd>, line 18)

Using basic knowledge of requests and selenium we can scrap almost anything there is.