Notebook is copyright &copy; of <a href="https://ajaytech.co"> Ajay Tech</a>. You can find an online version of the same at <a href="https://ajaytech.co/python-machine-learning-lifecycle"> Machine Learning Lifecycle</a> or on <a href="https://github.com/ajaytech002"> Ajay Tech's gitjhub page</a>

# Machine Learning Lifecycle

## Contents

- What is Machine Learning lifecycle
- Data Ingestion
  - Data Import
  - Feature Extraction
  - Data Preprocessing
  - Imputation of missing values
  - Dimensionality Reduction
- Data Modeling<sup>*</sup>
- Deployment<sup>**</sup>

<sup>\*</sup> - _will be dealt with in week 3 and week 4_ <br>
<sup>**</sup> - _will be dealt with on day 19_

### Machine Learning Lifecycle

Just like any project following the software engineering process, Machine Learning also has a lifecycle. Since Machine Learning is more data oriented, the bulk of the time is spent with data. At a high level, the machine learning lifecycle looks something like this.


<img src="./pics/machine-learning-lifecycle-high-level.png"/>

We are not talking about some of the much higher level project activities like

- Project Objectives
- Staffing
- Risk Management etc

Those will be talked about in the context of pure _Project Management_. In this section, we will be talking about the activities that you would have to be part of as either a **Machine Learning Engineer** or __Project lead__. 

If you are wondering why the boxes are not even in size, it is signify the amount of time you will be spending in each of these activities. As you can see, the bulk of the activities are centered around the Data Ingestion process - and that will be the focus of this section. Modeling will be what the rest of this course will focus on. Deployment will focus on how the actual Machine Learning solution will be deployed in a live environment and how the results will be distributed to the users.

### Data Ingestion

This is where you will be spending most of your time as an ML engineer. Data is messy - there is so many things to be done like finding the right data sources, cleansing, deduplication, validation etc. These are pretty broad topics that require a variety of skills like SQL, data pre-processing techniques, good excel skills and so on. We will not be discussing all of the steps in data ingestion. We will only be focusing on the following activities highlighted in bold, specifically in the context of NumPy, Pandas & Scikit Learn. 

- Data Import
  - **Excel files**
  - **Flat files**
  - **Web Scraping**
  - **API**
  - Databases
- Feature Extraction
- Data Preprocessing
  - **Feature Scaling**
  - **Non-linear transformations**
  - **Encoding Categorical Features**
- Imputation of missing values
  - **Univariate**
  - **Multivariate**
- Dimensionality Reduction<sup>*</sup>


<sup>*</sup>  _Will be dealt with on day 18_


#### Data Import 

Data import is not a tedious step by typically time consuming. Sourcing the data is not all that straight forward most of the time. 

- **Easy** - Sometimes, data is readily available. For example, if you were doing movie recommendations algorithm in Netflix, most of the data is readily available in their database. 
- **Medium** - Data is readily available but in different silos/formats. For example, in the same example as above, imagine you were to get data related to external movie ratings (on top of netflix's own movie data). This would require some level of data mangling, munging, mixing etc. 
- **Hard** - Data is sometimes hard to get using regular methods. You might have to resort to special techniques like data scraping, write bulk downloaders using APIs etc . In some of these cases, the quality of data might also be questionable. 

We will be dealing with some of the simpler methods of importing data.

#### Import Data from Excel files

**Using NumPy**

Numpy does not have functionality to upload data directly from excel (in .xls or .xlsx format). However, you can convert it to a CSV in excel and use the **genfromtxt ( )** function.

In [8]:
import numpy as np

data = np.genfromtxt("./data/iris.csv",delimiter=",",skip_header=1)
data[0:4,:]

array([[5.1, 3.5, 1.4, 0.2, 0. ],
       [4.9, 3. , 1.4, 0.2, 0. ],
       [4.7, 3.2, 1.3, 0.2, 0. ],
       [4.6, 3.1, 1.5, 0.2, 0. ]])

**Using Pandas**

To read excel files, a python package **xlrd** is required. Once installed, you can use Pandas' **read_excel ( )** function.

<pre>
> pip install xlrd
</pre>

In [1]:
import pandas as pd

data = pd.read_excel("./data/shopping_cart.xlsx")
data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom


#### Import Data from Flat files

**Using NumPy**

We have already seen uploading data from CSV to array using numpy's **genfromtxt ( )** function. However, you can use any other delimiters like 
- tab ( \t )
- pipe delimited ( | ) etc

In [9]:
import numpy as np

data = np.genfromtxt("./data/iris.txt",delimiter="\t",skip_header=1)
data[0:4,:]

array([[5.1, 3.5, 1.4, 0.2, 0. ],
       [4.9, 3. , 1.4, 0.2, 0. ],
       [4.7, 3.2, 1.3, 0.2, 0. ],
       [4.6, 3.1, 1.5, 0.2, 0. ]])

**Using Pandas**

Pandas has a function (**read_csv**) to load data with any kind of delimiter ( like tab delimited, pipe delimited etc). 
- tab ( \t )
- pipe delimited ( | ) etc

In [10]:
import pandas as pd

data = pd.read_csv("./data/iris.csv")
data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [12]:
import pandas as pd

data = pd.read_csv("./data/iris.txt",delimiter="\t")
data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


#### Import Data using Web Scraping

**Downloading HTML tables using Excel**

Simple HTML tables on the web can be downloaded using Excel's data function. For example, in some of the chapters of this course, I have downloaded population data from <a href="https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)">Wikipedia</a> using Excel.

<img src="./pics/wikipedia-population-data.png"/>

To download it from excel, go to the following menu location.

<img src="./pics/excel-download-from-web.png"/>

Enter the URL and click _Import_.

<img src="./pics/excel-web-query.png"/>

Data is downloaded into excel cells.

<img src="./pics/data-in-excel.png"/>

**Scrape Websites**

Sometimes the only form of data available is on the browser - for example, you are a third party aggregator trying to gather the best promotion on flight tickets from multiple websites. The actual website might not be willing to give you the data straight away. In cases like this, you have to literally scrape the price/discount off of their website. 

Luckily, there are some libraries in Python that can do all the heavy lifting ( HTTP handshake, parsing, creating deep data structures etc). One such library is <a href="https://www.crummy.com/software/BeautifulSoup/">Beautiful Soup</a>. Let's see how to scrape 

Install BeautifulSoup version 4
<pre>
> pip install beautifulsoup4
</pre>

Let's find out the price of iphone Xs. Go to the apple website and navigate to the iphone Xs page. Set that page as a variable

In [54]:
url = "https://www.apple.com/shop/buy-iphone/iphone-xs"

Beautiful Soup does not actually go out to the web and get the web page. For that we have to use another Python standard library called **requests**. It is a basic HTTP request library that can go out and get content on the web for us. Once we get the actual content of the web page, Beautiful Soup can parse it and present it in a searchable object.

In [55]:
from bs4 import BeautifulSoup
import requests

Get the web page content and give it to Beautiful Soup to parse.

In [56]:
html = requests.get(url).content

soup = BeautifulSoup(html,'html.parser')

Now that we have the content, we have to figure out where exactly the prices are stored. In order to find out the tag where the price is stored, just right click on the web page in the browser and select _View Page Source_. In the page source, search for the price you are looking for. For example, the current price of iphone is 999 dollars. Search the page source with 999. 

<img src="./pics/search-view-source.png"/>

The prices are displayed using a _span_ tag with class *current_price*. Pull out all the class tags with value of "current_price". There are multiple ways to do it, but we will just look at one. 

In [57]:
soup.select(".current_price")[0]

<span class="current_price">From <b>$549</b></span>

We are just looking at the first atttribute and there are many more prices ( based on the options selected). 

Beautiful soup is good enough for low volume web scraping. For high volume web scraping (search engine level web scraping), use <a href="https://scrapy.org/">Scrapy</a>

**API**

API stands for _Application Programming Interface_. It is a way to give programmatic access to a resource. For example, your Alexa machine goes out automatically(programmatically) and fetches the weather data for a particular zip code from weather.com. How does it do it ? 

Weather.com provides an **API** to programmatically fetch weather data. Other examples could be xe.com providing _API_ for exchange rates or Bloomberg providing _API_ for stock tickets. etc. 

In this section, let's using Python to get the weather information on a particular zip code. In order to avoid abuse and keep track of requests, most of the time an API _Key_ is provided. You can sign up for weather.com and a key will be provided to you. Without that key weather.com would not honour API requests. 

<img src="./pics/api-key.png"/>

**APIs** are typically exposed as URLs. For example, to get the weather by a city, use the following API.

<img src="./pics/weather-api-by-city.png"/>

Let's use Python to extract weather for a city in India - say Hyderabad. Don't forget to append the API key using the attribute _appid_. See the url formation below.

In [58]:
import requests

url = "http://api.openweathermap.org/data/2.5/weather?q=Hyderabad&appid="
key = "37a81ae1e682ac******b0a3727080a6"

url = url + key

html = requests.get(url).content


In [59]:
html

b'{"coord":{"lon":78.47,"lat":17.36},"weather":[{"id":803,"main":"Clouds","description":"broken clouds","icon":"04d"}],"base":"stations","main":{"temp":303.29,"pressure":1008,"humidity":66,"temp_min":302.59,"temp_max":304.15},"visibility":6000,"wind":{"speed":5.7,"deg":250},"clouds":{"all":75},"dt":1561704820,"sys":{"type":1,"id":9214,"message":0.0071,"country":"IN","sunrise":1561680862,"sunset":1561728236},"timezone":19800,"id":1269843,"name":"Hyderabad","cod":200}'

Incidentally, weather.com provides data in a specific format called JSON. JSON stands for **Java Script Object Notation**. Once again, Python provides a standard library called **json** that can prase JSON data for us.

In [61]:
import json

data = json.loads(html)
data

{'coord': {'lon': 78.47, 'lat': 17.36},
 'weather': [{'id': 803,
   'main': 'Clouds',
   'description': 'broken clouds',
   'icon': '04d'}],
 'base': 'stations',
 'main': {'temp': 303.29,
  'pressure': 1008,
  'humidity': 66,
  'temp_min': 302.59,
  'temp_max': 304.15},
 'visibility': 6000,
 'wind': {'speed': 5.7, 'deg': 250},
 'clouds': {'all': 75},
 'dt': 1561704820,
 'sys': {'type': 1,
  'id': 9214,
  'message': 0.0071,
  'country': 'IN',
  'sunrise': 1561680862,
  'sunset': 1561728236},
 'timezone': 19800,
 'id': 1269843,
 'name': 'Hyderabad',
 'cod': 200}

Once you have the data in a JSON object, you can just use simple object notation to extract the data. For example, to get the city, use  

In [70]:
data["name"]

'Hyderabad'

To get the minimum and maximum temperature, use

In [74]:
data["main"]["temp_min"]

302.59

In [75]:
data["main"]["temp_max"]

304.15

Just in case you are wondering why the temperature is so large, it is because the unit of temperature is Kelvin.