# CIS 9
# Data

Data science cannot work without data, so here are some considerations when we inspect the input data before we use it,  and when we draw conclusion from the data.

### Where to get data

From established government offices, organizations, businesses
<br>Examples:
<br>Government offices
- US Census Bureau
- Center for Disease Control and Prevention
- US Bureau of Labor Statistics


World wide organizations
- World Health Organization
- World Bank Open Data


Local government or organizations
- Santa Clara County Government
- California State University Institutional Research and Analyses


Businesses
- Housing data from Zillow
- Yelp dataset

From the sources above there are several ways to gather data.

From web APIs (Application Programming Interface)
- A web API is the way that a company or organization shares their data with web clients. 
- An API organizes and packs data that so that the data can be easily downloaded and read by the web clients.
- A web API is useful when data changes quickly, such as with stock prices.  
- An API is also useful when a large data set needs to be downloaded, such as map data.
- A web API has a particular URL and several endpoints. 

[Directory of common and not so common APIs](https://www.programmableweb.com/category/all/apis)


In [None]:
# Example: Use the API of the ISS (International Space Station) to see where the ISS is currently located.

import requests
import time

page = requests.get("http://api.open-notify.org/iss-now.json")
data = page.json()
print("Current time:", time.ctime(data['timestamp']))
print("Latitude:", data['iss_position']['latitude'], "\nLongitude:", data['iss_position']['longitude'])

From csv or excel files
<br>Instead of an API, some organizations prepare data files that can be downloaded.
<br>The DataFrame can read in several types of data, some of which we've already used:
- pd.read_excel(filename)
- pd.read_csv(filename)

But there are other file types that can be read directly into a DataFrame as listed [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html).

In [None]:
# Example: read from a csv file online
import pandas as pd

url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/yelp.csv'
yelp = pd.read_csv(url)    # yelp data for lab 4
yelp.head()

From webscraping and webcrawling
<br>Some organizations only present their data on their webpage. There is no API or file download. In this case we can _webscrape_ or extract data from the HTML page.
<br>When the data that we extract is a URL, we can use that URL to get to the next webpage. This is called _webcrawling_.

Example: CIS Dept webpage: http://deanza.edu/cis/schedule.html

In [None]:
# Example: extract all CIS classes from the CIS Dept webpage

import requests
from bs4 import BeautifulSoup

page = requests.get("http://deanza.edu/cis/schedule.html")
soup = BeautifulSoup(page.content, "lxml")
table = soup.find('table')
for tag in table.find_all('a') :
    print(tag.text)

From surveys
<br>If we have a budget and personnel, we can also conduct surveys of specific groups of people. This way the data is customized for our application.

---

### Data imbalance

Data imbalance occurs when the input datasets are skewed in some ways. In a classification problem, it occurs when one data label has many more instances than other data labels in the set. It can also show up in the distribution curve of the input dataset, where the distribution curve is skewed to one side or there are 2 modes in the distribution.

A problem with an imbalanced dataset is that our model learns the wrong way and makes incorrect predictions. For example, if the dataset is 95% type A data and 5% type B data, then the algorithm, which is generally pretty smart because it can calculate probabilities, will simply predict most data to be of type A (and it will be 95% accurate!).

For some applications it is expected to have imbalanced input data. For example, in detecting bank frauds, the majority of the transactions will be legitimate and only a small percentage will be frauds.

__How to address data imbalance__

- The obvious way: collect more data, if it is not expected that the data will be imbalanced.
- Determine accuracy correctly. In a classification problem, the confusion matrix or the F1 score can give a more meaningful interpreation than the accuracy score.
- If the data is meant to be imbalanced, we can generate synthetic data to train the model.
- Try different algorithms to check the accuracy. Some models are better suited for imbalance data.
- Change the goal of the study slightly. For example, if there are too many legitmate banking transactions compared to the fraudulent ones based on the amount of money being transferred, we can change the study to look at customer behavior instead of amount of money. If there are 2 credit card transactions from the same credit card that occur in San Jose, USA, and in Barcelona, Spain, at almost the same time, then that could be a good prediction of fraud.

---

### Data bias

In addition to data imbalance, we also need to consider data bias. When the training data comes from a world full of inequalities, the algorithm may be learning how to keep propagating those inequalities. 

In the criminal justice system there have been discussions about using ML to predict and assess bails, and how the predicted bail amount differs based on race. If the model incorrectly predicts that people of a particular ethnicity are more likely to commit a crime, their bail amount would be unfairly assessed.

Data bias can occur also when a company uses an ML algorithm to vet resumes of job applicants. In one instance a model ends up recommending candidates for interviews based on gender. The study concluded that it was because historical data from surveying hiring managers showed that in the past, there were more men who worked in the company, so the model learns that men are hired more and recommends male candidates.

One of the most common discussed data bias is on facial recognition. Depending on how diverse the age, gender, race of the people in the training dataset, the model can differentiate between one person and another in a specific group of people, but cannot differentiate between different people of another group.

With the advance of NLP, there are also data bias in text. An example is in language translation. Google engineers from the Google Translate project have found that from some language, the translation tends to be "He is a doctor" vs. "She is a nurse". To combat this, the engineers have discussed showing both forms: "He is a doctor" and She is a doctor" are both presented.

[Here](https://sitn.hms.harvard.edu/uncategorized/2020/fairness-machine-learning/) is an article that discusses more in-depth the problem of data bias and fairness in ML.

---