# Step 1 - Acquiring Data

In this module, we will start our process to develop a machine learning model. The very first step is to obtain our data, and there are multiple ways to do that, depending on the type of data you're working with. By the end of this module, you will be able to:

* Identify multiple sources of data
* Identify different formats in which data can be found
* Use multiple Python libraries to collect data

In [None]:
# We load our libraries
import sqlite3
import requests
import pandas as pd
from bs4 import BeautifulSoup

## Using traditional data formats

A lot of data can be found in open websites ready to be downloaded. Some of the most common formats in which data can be found are:

* Text files (.txt)
* Comma-Separated Values (.csv)
* Excel (.xls)

Let's load an example of each of them using Python. For this first part, we have previously downloaded 3 datasets publicly available at [data.gov](https://data.gov/).

### Text files

They can easily be opened in Python through the open function. The following file, represents the probability of female death by age and year. 

In [None]:
f = open("data/female_death_probabilities.txt", "r")
f.read()

We have read our file. However, it is not yet a dataset, but a string with a lot of lines. In most text files, some extra transformations must be made in order to have obtain functional dataframes. The following code achieves that by:
1. Creating a dictionary that will eventually become our dataset.
2. Extracting the lines and columns from the text.
3. Iterating over the lines (starting from line 3), to extract the values and append them to the df.
4. Creating a pandas dataframe object.

In [None]:
f = open("data/female_death_probabilities.txt", "r")
data = {
    "Year": [],
    "Age": [],
    "Probability_of_Death": []
}
text_data = f.read()
lines = text_data.strip().split('\n')
columns = lines[1].split()

for line in lines[2:]:
    values = line.split()
    year = values[0]
    probabilities = values[2:]
    
    for age, prob in zip(columns[2:], probabilities):
        data["Year"].append(year)
        data["Age"].append(int(age))
        data["Probability_of_Death"].append(float(prob))

df = pd.DataFrame(data)
df

As we can see, now we have a dataframe we can work with.

### CSV Files

A very popular type of file with which you might already be familiar is the Comma-Separated Values format. It is very likely that you have loaded a csv file in an Excel spreadsheet at this point. To read this type of files and load them as a dataframe, we can simply use the *read_csv* function from Pandas. The following dataset shows the population of electric vehicles in the state of Washington.

In [None]:
# Loading a csv file example https://catalog.data.gov/dataset/electric-vehicle-population-data
vehicle_df = pd.read_csv("data/Electric_Vehicle_Population_Data.csv")
vehicle_df.head()

## Accessing traditional databases

Sometimes, data exists in a database format and follows the [relational database model](https://en.wikipedia.org/wiki/Relational_model). This model is the state of the art in most of today's database management systems.

The default language to connect with a relational database and retrieve data from it is the **Structured Query Language (SQL)**. In Python, you can explore databases and execute SQL queries through the [sqlite3](https://docs.python.org/3/library/sqlite3.html) library. In the following cells, we will execute some queries using the open Chinook database. You can learn more about this dataset on this [link](https://github.com/lerocha/chinook-database).

In [None]:
# First, we establish a connection with the database
conn = sqlite3.connect("data/Chinook_Sqlite.sqlite")

# The next step is to create cursor object to interact with the database
cursor = conn.cursor()

# The very first query that we will execute, will show the different tables that are part of our database
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")

# Now, we fetch the results
tables = cursor.fetchall()
tables

As we can see, we have multiple tables in our database. Let's now perform some example queries.

In [None]:
# Retrieve all the columns information from the Album table
cursor.execute("PRAGMA table_info(Customer);")
columns = cursor.fetchall()
columns

Let's briefly summarize what each element of each tuple means:

1. Index 0 is the index of the column
2. Index 1 is the of the column
3. Index 2 is the data type of the column
4. Index 3 is a dummy which indicates if the column allows NULL values (1 for TRUE, 0 for FALSE)
5. Index 4 is the default value of the column
6. Index 5 is a dummy which indicates if the column is a primary key (1 for TRUE, 0 for FALSE)

In [None]:
# Select the first 3 rows of the Customer table
cursor.execute("SELECT * FROM Customer LIMIT 3;")
rows = cursor.fetchall()
rows

In [None]:
# Select the Name and email of the Customers where Country = Germany
cursor.execute("SELECT FirstName, Email FROM Customer WHERE Country = 'Germany'")
names = cursor.fetchall()
names

In [None]:
# Select the top 3 invoices where the total was equal or higher than 10.
cursor.execute("SELECT * FROM Invoice WHERE Total >= 10 ORDER BY Total DESC LIMIT 3")
invoices = cursor.fetchall()
invoices

We have performed a few SQL queries to give you a sense on how to retrieve data from a traditional database. However, this is not a SQL focused module. If you want to learn more SQL, you can follow some of the following links:

* [Geeks for Geeks](https://www.geeksforgeeks.org/sql-tutorial/)
* [Khan Academy](https://www.khanacademy.org/computing/computer-programming/sql)
* [W3 Schools](https://www.w3schools.com/sql/)

## Accessing NoSQL databases

Many open databases follow a non-relational format or NoSQL. Most of these databases follow a [JSON](https://en.wikipedia.org/wiki/JSON) format and can be accessed through the use of an [API](https://en.wikipedia.org/wiki/API). 

Let's look an example on how to retrieve a database through an API. For the following example, we are going to use the OpenWeather platform, which offers updated weather data for multiple locations around the world.

In most of the cases, in order to retrieve data, you need first to obtain an API key, so the server can recognize you as a validate user and allow you to collect the data. OpenWeather is not the exception. Before starting the coding section, you'll need to create an account to obtain an API key:
* [OpenWeather Sign Up](https://home.openweathermap.org/users/sign_up)

In [None]:
# Collecting Weather Data from OpenWeather
api_key = "4722cd5944f586c5451f6e105a65e129" # Replace with you API Key

# For this example, we are only going to retrieve today's weather data for the city of San Diego
city = 'San Diego' # You can test with a different city
country = 'US'

url = f'http://api.openweathermap.org/data/2.5/weather?q={city},{country}&appid={api_key}'

# Send a GET request
response = requests.get(url)

if response.status_code == 200: # We verify that our request was successful
    data = response.json()
else:
    print(response.status_code) # In case our request was not successful, we return the status code
data

As you can see, we collected our data in JSON format. However, this data is hard to work with in the current format. Let's do some transformations to use the data we need in a pandas dataframe object.

In [None]:
# We are going to keep the name of the city, country, temperature, humidity and the description of the weather 
weather_data = {
        'City': data['name'],
        'Country': data['sys']['country'],
        'Temperature (Celsius)': data['main']['temp'] - 273.15,  # Data comes in Kelvin by default. We convert to Celsius.
        'Humidity (%)': data['main']['humidity'],
        'Description': data['weather'][0]['description']
    }

# Create a Pandas DataFrame
df = pd.DataFrame([weather_data])
df

In this example, we only collected data for a single city. However, you can think of a case in which you could retrieve data from a lot of cities and multiple dates to build a huge dataset.

## Web-Scrapping

Another way to collect data is through Web-Scrapping, which refers to the technique of directly collecting data from websites. A common python library to perform this task is [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) 

Imagine for example, that you wanted to analyze the sentiment and narrative of some of the most important news websites through the titles of their articles. In the following code, we are going to collect all the article titles from the BBC News main page. 

In [None]:
# Web-Scrapping job to collect the titles of the news.
url = 'https://www.bbc.com/news'
response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
    article_titles = soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])

    for title in article_titles:
        print(title.text.strip())
else:
    print(response.status_code)

Let's briefly recap what we did in the previous code: 
1. We defined the url of the page to parse and requested that page, receiving a response object.
2. We created our soup object as a HTML parser.
3. We requested the parser to find all the header objects. We placed all those objects in a list.
4. We printed the text of each of those objects.
   
You can think of more extended versions of the previous exercise, like extracting the full articles and extracting data from multiple news sites.