# Data Science

## Notebook 2 (APIs, web-scraping, classification, evaluation)

### Collecting data

* Dowloading data files directly
* Scraping data from a web page 
* Extract data from a web API

### API (Application Programming Interface)

* facilitates applications/websites to communicate with each other
* similar to a UI (User Interface) but designed for use by other applications rather than humans

How it works:
1. the client initiates an API call, or "request", to retreive information from the API
2. after recieving a valid request, the API makes a call to the webserver for that information
3. the server sends a response to the API with the requested information
4. the API transfers the data to the client application that requested the information


* we can extract data from a web API too (sometimes you need to get an API key for authentication)

#### Common Examples of APIs

* universal logins (e.g. "login with Google")

<span style="color:red">Other examples?</span>

In [None]:
import requests

api_key = 'J0CjIDeC0ArC2OdneTWp210aqwgHBhXc'
url = 'https://api.nytimes.com/svc/archive/v1/2019/1.json?api-key=' + api_key
nyt = requests.get(url)
print(nyt.status_code)

#### Most frequently used file formats
* CSV (comma sperated values)
* JSON (JavaScript Object Notation):
    * A text format that is independent of particular programming languages
    * Most of the web APIs return data in JSON data format
* HTML
    * hierarchical structure (tags)

##### Example - a possible JSON representation of a person

In [None]:
json_ex = {
  "firstName": "John",
  "lastName": "Smith",
  "isAlive": True,
  "age": 25,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021-3100"
  },
  "phoneNumbers": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "office",
      "number": "646 555-4567"
    },
    {
      "type": "mobile",
      "number": "123 456-7890"
    }
  ],
  "children": [],
  "spouse": None
}

In [None]:
type(json_ex)

### Importing JSON files

In [None]:
import json

a = {'a': 1, 'b':2}
s = json.dumps(a)
a2 = json.loads(s)

In [None]:
a

In [None]:
s

In [None]:
type(s)

In [None]:
a2

In [None]:
type(a2)

<span style="color:red">Where does this file get stored?</span>

In [None]:
with open('../Data/data.txt', 'w') as outfile:
    json.dump(a, outfile)

#### Importing data about the Hungarian National Football team at 2020 European Football (Soccer) Championship

In [None]:
with open('../Data/hungary-euro2020.json','r') as f:
    he2020=json.load(f)
he2020

In [None]:
import pandas as pd

pd.DataFrame(he2020).head()

In [None]:
players_list = he2020['Players']
players_list
print(type(players_list))
print(type(players_list[0]))
players_list[0]

##### Collecting the important information

Filtering important attributes

In [None]:
keys = ['pFName', 'Minutes Played', 'Goals', 'BirthDate', 'Club']

In [None]:
# extracting information about players
players_dat = []
for player in players_list:
    player_dat = []
    for key in keys:
        player_dat.append(player[key])
    players_dat.append(player_dat)

In [None]:
# same as above but with list comprehension 
players_dat = [[player[i] for i in keys] for player in players_list] 
players_dat[0:5]

From list to Pandas data frame

In [None]:
players_df = pd.DataFrame(players_dat)
players_df.columns = ['Full name', 'Minutes played', 'Goals', 'Date of birth', 'Club']
players_df.head()

## Web Scraping

- Urllib: the most basic webscraping package 
- BeautifulSoup: Lightning fast static web page processing
- Selenium: To manipulate, test and scrape dynamic HTML web pages. It is able to use the elements of the user interface, for example fill in forms, check the checkboxes and so on...

**Strategy:** 
1. Extract the data with Selenium or Urllib
2. Turn the data into a beautiful soup object to search the html with BeautifulSoup functions.

**Be careful and fair (copyright, license, media law, server overloading, etc.)!**
- the robots.txt file, generally located at the root of a website, communicates to webcrawlers about what is off limits to scrape, protecting their intellectual property
- it is not a legal requirement but a widely used convention

For web scraping we should have a basic understanding of HTML tags!

## HTML, just the basics 

### Tags, Elements, Attributes

Tag: < something between angle brackets > <br>
Some important tags:
- <div>                   sections
- <h1> - <h6>             headings
- <p>, <pre>              paragraphs
- <ul>, <ol>, <li>        lists
- <a>                     links
- <img>                   images
- <table>                 tables
- <tr>                    rows within a table

Element: It begins with a tag, ends with a tag, and it contains the text between them aswell.
for example:

<html>

    <head>
    
        <h1> Fake header on a fake website </h1>
    
        <p style:'color:red'> This thing with it's tags is an element. </p>
        
    </head>
    
</html>


Attributes: Contains aditional information about the elements (in general they are in the opening of the tag). For example:


- <img src="picture.png">                  src is an attribute with the value "picture.png"
- <p style='color:red'> This is a red paragraph. </p>              color is a style attribute
- <a href="https://www.w3schools.com"> Visit W3Schools </a>  href is a link attribute

Learn more about HTML

https://www.w3schools.com/html/html_basic.asp

#### Most of the time when we scrape html pages we locate the information by it's tags and attributes. The most important attributes in this manner are id and class.

### DOM tree (Document Object Model)

<img src='DOM_tree_0.jpg'>
<img src='DOM_tree_1.jpg'>

#### The inner HTML of  the elements are of the same nature as the whole html page. We can use the same methods to locate an information for example in a div as we do in the html. 

### Scraping static webpages using Beautifulsoup and Urllib

**Reading an entire webpage as a string using Urllib**

In [1]:
from IPython.display import Image, display
from IPython.display import HTML

# the website we will scrape first
HTML('<iframe src=https://www.crummy.com/software/BeautifulSoup/ width=1000 height=400></iframe>')



In [2]:
from urllib.request import urlopen

url = 'https://www.crummy.com/software/BeautifulSoup/'
source = urlopen(url).read().decode('utf-8') # read in html and decode bytes into string
print(source)

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/transitional.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Beautiful Soup: We called him Tortoise because he taught us.</title>
<link rev="made" href="mailto:leonardr@segfault.org">
<link rel="stylesheet" type="text/css" href="/nb/themes/Default/nb.css">
<meta name="Description" content="Beautiful Soup: a library designed for screen-scraping HTML and XML.">
<meta name="generator" content="Markov Approximation 1.4 (module: leonardr)">
<meta name="author" content="Leonard Richardson">
</head>
<body bgcolor="white" text="black" link="blue" vlink="660066" alink="red">
<style>
#tidelift { }

#tidelift a {
 border: 1px solid #666666;
 margin-left: auto;
 padding: 10px;
 text-decoration: none;
}

#tidelift .cta {
 background: url("tidelift.svg") no-repeat;
 padding-left: 30px;
}
</style>		   

<img align="right" src="10.1.jpg" width="250"><br />

<p>[

The scraped contents of the url are just a string so you can use all of the python string manipulation methods you're familiar with like find(), count() and replace()

In [None]:
## count occurences of 'Soup'
print(type(source))
print(source.count('Soup'))

In [None]:
## find index of 'Reddit uses Beautiful Soup'
position =  source.find('Reddit uses Beautiful Soup')
print(position)

In [None]:
# test to see the substring
print(source[position:position + len('Reddit uses Beautiful Soup')])

In [None]:
import bs4 #this is beautiful soup

soup = bs4.BeautifulSoup(source)
print(soup)

In [None]:
print(type(soup))

In [None]:
print(soup.prettify()) # prettify() method to display the HTML in a readable format

In [None]:
header = soup.findall('title') # or .find() to just get the first instance of the requested tag
header[0]

In [None]:
header[0].text

<span style="color:red">How would I extract just the website title?</span>

In [None]:
# find all links
soup.findAll('a')

In [None]:
links = soup.find_all('a')
a_link = links[10]
print(a_link)
a_link.get('href') # search within one tag for a specific attribute using .get()

In [None]:
# store all links on the page in a list
link_list = [l.get('href') for l in soup.findAll('a')]
link_list

In [None]:
# extract all external links
external_links = []

# the loop filters out "None" and links that don't start with http
for l in link_list:
    if l is not None and l[:4] == 'http':
        external_links.append(l)
        
external_links

In [None]:
# the same this using list comprehension

[l for l in link_list if l is not None and l.startswith('http')]

### Scraping dynamic webpages using Selenium

In [None]:
!pip3 install selenium
!pip3 install webdriver-manager

In [None]:
%%capture --no-display
HTML('<iframe src=https://www.worldometers.info/coronavirus/weekly-trends/ width=1000 height=400></iframe>')

What if we need information on the weekly case/death change? - Our robot must click on the Columns dropdown menu and select the required columns. We need another package called Selenium to this end.

In [6]:
import selenium
from selenium import webdriver
from bs4 import BeautifulSoup
import time

1. Open the website

In [1]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://www.worldometers.info/coronavirus/weekly-trends/')

If this doesn't work for you, heres another way to open the website. You'll first have to install a Web Driver, for example Google Chrome:

https://chromedriver.chromium.org/downloads 

Make sure to download the version that matches the current version of your chrome browser! Store the webdriver .exe file somewhere you can access it with a relative file path.

In [None]:
options = webdriver.ChromeOptions()
s = Service('../Data/chromedriver/chromedriver.exe') # chromedriver is an .exe file, depending on your computer you may have to specify this file extension
driver = webdriver.Chrome(service=s, options=options)
driver.get('https://www.worldometers.info/coronavirus/weekly-trends/')

2. Select the dropdown menu

In [4]:
from selenium.webdriver.common.by import By

element = driver.find_element(By.CLASS_NAME,'dropdown-toggle')
print(element.text) # '.text' identifies whats on the website so check to make sure it's right

Columns


3. Click on the dropdown menu

In [7]:
element = driver.find_element(By.CLASS_NAME,'dropdown-toggle')

time.sleep(2) # pause to ensure that the website has fully opened, sometimes selenium is too fast
element.click() # element must be visible in the simulated web browser in order to click on it
time.sleep(2)

element = driver.find_element(By.ID,'colsDrop')
element.text

ElementClickInterceptedException: Message: element click intercepted: Element <button title="Click to hide/show columns" style="margin-top:-2px;font-size:14px;" class="btn btn-sm btn-secondary dropdown-toggle" type="button" data-toggle="dropdown" aria-haspopup="true" aria-expanded="false">...</button> is not clickable at point (357, 17). Other element would receive the click: <iframe id="aswift_7" name="aswift_7" style="width: 100vw !important; height: 100vh !important; inset: 0px auto auto 0px !important; position: absolute !important; clear: none !important; display: inline !important; float: none !important; margin: 0px !important; max-height: none !important; max-width: none !important; opacity: 1 !important; overflow: visible !important; padding: 0px !important; vertical-align: baseline !important; visibility: visible !important; z-index: auto !important;" sandbox="allow-forms allow-popups allow-popups-to-escape-sandbox allow-same-origin allow-scripts allow-top-navigation-by-user-activation" width="" height="" frameborder="0" marginwidth="0" marginheight="0" vspace="0" hspace="0" allowtransparency="true" scrolling="no" src="https://googleads.g.doubleclick.net/pagead/html/r20240306/r20110914/zrt_lookup_fy2021.html#RS-0-&amp;adk=1812271808&amp;client=ca-pub-3701697624350410&amp;fa=8&amp;ifi=8&amp;uci=a!8" data-google-container-id="a!8" data-google-query-id="CIif3N-g54QDFSJfHgIdriQI4g" data-load-complete="true"></iframe>
  (Session info: chrome=122.0.6261.111)
Stacktrace:
#0 0x5c03607e6ec3 <unknown>
#1 0x5c03604dece6 <unknown>
#2 0x5c0360530a73 <unknown>
#3 0x5c036052e98e <unknown>
#4 0x5c036052c367 <unknown>
#5 0x5c036052b732 <unknown>
#6 0x5c036051ed27 <unknown>
#7 0x5c036054beb2 <unknown>
#8 0x5c036051e6b8 <unknown>
#9 0x5c036054c07e <unknown>
#10 0x5c036056a899 <unknown>
#11 0x5c036054bc53 <unknown>
#12 0x5c036051cdb3 <unknown>
#13 0x5c036051d77e <unknown>
#14 0x5c03607ac7fb <unknown>
#15 0x5c03607b0815 <unknown>
#16 0x5c036079a111 <unknown>
#17 0x5c03607b13a2 <unknown>
#18 0x5c036077e1ef <unknown>
#19 0x5c03607d54b8 <unknown>
#20 0x5c03607d56b3 <unknown>
#21 0x5c03607e6074 <unknown>
#22 0x771bea494ac3 <unknown>


In [8]:
element.get_attribute('innerHTML')

'\n                Columns <b class="caret"></b>\n            '

4. Click the checkboxes

In [9]:
element1 = element.find_element(By.ID, 'column_10')
element2 = element.find_element(By.ID, 'column_12')

driver.execute_script("arguments[0].click();", element1)
driver.execute_script("arguments[0].click();", element2)

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"[id="column_10"]"}
  (Session info: chrome=122.0.6261.111); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
#0 0x5c03607e6ec3 <unknown>
#1 0x5c03604dece6 <unknown>
#2 0x5c0360529e48 <unknown>
#3 0x5c0360529f01 <unknown>
#4 0x5c036051e7c6 <unknown>
#5 0x5c036054bedd <unknown>
#6 0x5c036051e6b8 <unknown>
#7 0x5c036054c07e <unknown>
#8 0x5c036056a899 <unknown>
#9 0x5c036054bc53 <unknown>
#10 0x5c036051cdb3 <unknown>
#11 0x5c036051d77e <unknown>
#12 0x5c03607ac7fb <unknown>
#13 0x5c03607b0815 <unknown>
#14 0x5c036079a111 <unknown>
#15 0x5c03607b13a2 <unknown>
#16 0x5c036077e1ef <unknown>
#17 0x5c03607d54b8 <unknown>
#18 0x5c03607d56b3 <unknown>
#19 0x5c03607e6074 <unknown>
#20 0x771bea494ac3 <unknown>


5. Download the source

In [None]:
source = driver.page_source

We need a pandas data frame with the countires and their weekly case change and death change. Let's process the html source / data with BeautifulSoup.

In [None]:
soup = BeautifulSoup(source)

In [None]:
# arguments: tag, {attribute: value}
table = soup.find('table', {'id':'main_table_countries_today'})
table

In [None]:
rows = table.find_all('tr')
rows[7] # use .text to understand what part of the site it is

In [None]:
countries=[]

for row in rows:
    countries.append(row.find('a', {'class':'mt_a'}))
    
countries

In [None]:
countries=list(filter(lambda x: x != None, countries))
countries

In [None]:
for row in range(len(countries)):
    countries[row] = countries[row].text
    
countries

In [None]:
list(map(lambda x: x.text, rows[0].find_all('th'))) # map() applies a function to each element of an iterable

Now we know every tag that we are looking for, so we can extract the information from the html source!

In [None]:
rows=table.find_all('tr', {'role':'row'})
countries=[]
weekly_case_change=[]

for row in rows:
    try:
        countries.append(row.find('a', {'class':'mt_a'}).text)
        weekly_case_change.append(row.find_all('td')[4].text)
    except:
        continue

In [None]:
import pandas as pd

df = pd.DataFrame({'Countries':countries, 'Weekly_Case_Change':weekly_case_change})
df

IMPORTANT: Don't scrape the data from the website over and over (server overload), instead save a datafile to your personal computer

In [None]:
df.to_csv('../Data/covid_scrape.csv')

In [None]:
df = pd.read_csv('../Data/covid_scrape.csv')
df.drop(labels=['Unnamed: 0'],axis=1,inplace=True)
df.head()

### Classification

#### Importing data

In [None]:
bank_data = pd.read_csv("../Data/bank.csv", delimiter = " ", names = ['age', 'sex', 'region', 'income', 'married', 'children', 'car','save_acct', 'current_acct', 'mortgage', 'pep'])

We have already worked with this data, the attributes:

* age: age of customer in years (numeric)
* sex: MALE / FEMALE
* region: inner_city/rural/suburban/town
* income: income of customer (numeric)
* married: is the customer married (YES/NO)
* children: number of children (numeric)
* car: does the customer own a car (YES/NO)
* save_acct: does the customer have a saving account (YES/NO)
* current_acct: does the customer have a current account (YES/NO)
* mortgage: does the customer have a mortgage (YES/NO)
* **pep: did the customer buy a PEP (Personal Equity Plan) after the last mailing (YES/NO)**

In [None]:
bank_data.head()

In [None]:
numeric_data = bank_data.replace(['NO', 'YES', 'MALE', 'FEMALE'],[0,1,0,1])
numeric_data.head()

### Turn categorical data into numerical data:

With one-hot-encoding with pandas get_dummies():

In [None]:
one_hot = pd.get_dummies(numeric_data['region'])
one_hot = one_hot.replace([True,False],[1,0])
one_hot.head()

### Unlike pandas merge(), join() combines dataframes by index, so be careful not to use it if you have changed the rows

In [None]:
numeric_data = numeric_data.drop('region', axis = 1)
numeric_data = numeric_data.join(one_hot)
numeric_data.head()

#### Introducing scikit-learn

In [None]:
from IPython.core.display import Image, display
from IPython.core.display import HTML
HTML('<iframe src=http://scikit-learn.org/stable/ width=1000 height=400></iframe>')

Scikit-learn cheat sheet

In [None]:
HTML('<iframe src=https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Scikit_Learn_Cheat_Sheet_Python.pdf width=1000 height=400></iframe>')

Machine learning with sci-kit learn:

1. Define the model!
```python
clf = LogisticRegression()
```

2. Fit the model!
```python
clf.fit(X_train, y_train)
```

3. Use the model for prediction!
```python
clf.predict(X_test)
```

**Preparing the label ("y", "target variable") and the other attributes**

For the bank dataset, our aim is to predict the PEP value 

In [None]:
numeric_data

In [None]:
bank_labels = numeric_data['pep']

In [None]:
bank_attrs = numeric_data.drop('pep', axis=1)

In [None]:
bank_attrs.head()

Import the classifier we will be using, the sklearn's metrics with helpful functions to evaluate the performance of our model and sklearn's train_test_split to randomly select what portion of our data will be in the train set and the test set

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.model_selection import train_test_split

**Splitting the data (features and lables) to train and test data**

In [None]:
bank_features_train, bank_features_test, bank_labels_train, bank_labels_test = train_test_split(bank_attrs, bank_labels, test_size=0.25, random_state=42)

**Initiate and fit the kNN model** 

Choosing parameters, such as # nearest neighbors

<img src="knn.jpg" width=400 height=400 />

In [None]:
neigh = KNeighborsClassifier(n_neighbors=11, metric="euclidean")
neigh.fit(bank_features_train,bank_labels_train)

In HW02, you will train a decision tree model. <span style="color:red">What parameters might a decision tree model have?</span> Parameter optimization: choosing the best value for these parameters can be tricky.

**Prediction**

In [None]:
predictions_binary = neigh.predict(bank_features_test)
predictions_binary

In [None]:
bank_labels_test[0:5]

In [None]:
predictions_proba = neigh.predict_proba(bank_features_test)
predictions_proba[0:5]

**Evaluation**

**Confusion Matrix**

In [None]:
cm = metrics.confusion_matrix(bank_labels_test,predictions_binary)
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=neigh.classes_)
disp.plot()
plt.show()

<p style="color:red">Which cell of the confusion matrix represents which term? </p>

**Precision, Recall, Accuracy**

In [None]:
print("Precision: ", metrics.precision_score(bank_labels_test,predictions_binary))
print("Recall: ", metrics.recall_score(bank_labels_test,predictions_binary))
print("Accuracy: ", metrics.accuracy_score(bank_labels_test,predictions_binary))

<p style="color:red">How would you calculate precision, recall and accuracy from the confusion matrix?</p>

**ROC, AUC on a small example**

In [None]:
y_true = [1,0,1,0,0,0,1,0,1,1]
y_score = [0.25,0.43,0.53,0.76,0.85,0.85,0.85,0.87,0.93,0.95]
# y_score = [1,0,1,1,0,0,1,0,1,0]
fpr, tpr, thresholds = metrics.roc_curve(y_true, y_score, pos_label=1, sample_weight=None)

In [None]:
tpr

In [None]:
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline  

plt.figure(figsize=(5,5))
plt.plot(fpr,tpr,linewidth=2.0)
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.xlim([0,1])
plt.ylim([0,1])

We can also compute the area under the ROC curve

In [None]:
metrics.roc_auc_score(y_true, y_score)

**ROC and AUC on Bank dataset**

For the binary class predictions

In [None]:
fpr, tpr, thresholds = metrics.roc_curve(bank_labels_test, predictions_binary, pos_label=1, sample_weight=None)

In [None]:
plt.figure(figsize=(5,5))
plt.plot(fpr,tpr,linewidth=2.0)
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.xlim([0,1])
plt.ylim([0,1])

In [None]:
metrics.roc_auc_score(bank_labels_test,predictions_binary, sample_weight=None)

<p style="color:red;">What is the AUC score of a completely random classifier?</p>

For the probability class predictions

In [None]:
fpr, tpr, thresholds = metrics.roc_curve(bank_labels_test, predictions_proba[:,1], pos_label=1, sample_weight=None)

In [None]:
plt.figure(figsize=(5,5))
plt.plot(fpr,tpr,linewidth=2.0)
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.xlim([0,1])
plt.ylim([0,1])

In [None]:
metrics.roc_auc_score(bank_labels_test,predictions_proba[:,1], sample_weight=None)

Well, quite bad... :(

Maybe decision tree classifier will lead us to a better result, try it out! ==> **HW2/2**