# Machine Learning
# Part 2: The Basic Work Flow in Python

We will together walk through a very basic supervised learning process that involves all the major steps except for the deployment step. 

Important: This lecture is prepared for students who are *already* experienced with predictive analytics using R. 

## 1\. Data gathering ~~and wrangling~~

The dataset we use is the **Lending Club** dataset. Refer to file "loan_data_description.pdf" (that you downloaded from Canvas) for details.

As a very basic code, we won't get into data wrangling yet today. (We don't even know what to wrangle with yet!) 

**Important:** Before running the code below, make sure:
+ you have uploaded the data file "loan_data.csv" into folder AMA/02a_ML_in_Python of your online Google Drive,
+ and you have mounted your Google Drive in this Colab session.
  + If not yet, click the "Files" icon on the left, then click "mount Drive".

### Load the LendingClub dataset

In [None]:
# Import the necessary Python packages (we'll discuss these packages later). This is analogous to library() in R.
import numpy as np
import pandas as pd

In [None]:
# Load the Lending Club dataset. Analogous to read.csv() in R.
loan = pd.read_csv('/content/drive/MyDrive/AMA/02a_ML_in_Python/loan_data.csv')

# Show on screen the first few records in this dataset. Analogous to head() in R, yet now as a method of object "loan".
loan.head(10)

### Data structure for supervised learning in Python

Similar to R, in Python we usually expect data in a ***table format*** for predictive analytics, as shown above.
+ each row is a **sample**/record, e.g. a customer
+ each column is an input variable/attribute/**feature**/predictor, e.g., FICO score of a customer
  + with the exception of one column being the output **target**/label /prediction/"dependent variable"

(**Bold font** above indicates terms most frequently used in the Python ML world.)

Two Python packages, **NumPy** and **pandas**, together support our data wrangling needs. We'll delve deeper into them later.



Different from R, in Python we usually expect the inputs and output to be stored separately as follows.



In [None]:
# Separate the dataset into a features matrix X and a target array y
X = loan.drop(columns=['not_fully_paid'])
y = loan['not_fully_paid']

**Features matrix** (a.k.a. the inputs)

The features matrix is often stored in a variable named `X`. The features matrix is assumed to be two-dimensional, with shape `[n_samples, n_features]`, and is most often contained in a NumPy `array` or a Pandas `DataFrame`. 

**Target array** (a.k.a. the output)

In addition to the feature matrix `X`, we also work with a *target array* (or called *label array*) for supervised learning. By convention we call this array `y`. The target array is usually one dimensional, with length `n_samples`, and is generally contained in a NumPy `array` or Pandas `Series`. Values in the target array can be either continuous or discrete. The target array is we want to *predict from the data*, such as whether a customer will default on a loan.

### Data wrangling

Is data wrangling important for real-life data? Absolutely yes.

Are we doing data wrangling today? Bravely no :) .

## 2\. Exploratory data analysis (EDA)

In the real world, EDA and related data wrangling will likely take most of your time. For today only, let's keep EDA to a bare minimum.

In [None]:
# How large is the dataset?
loan.shape
# Analogous to dim() in R.

In [None]:
# Any missing data?
loan.isnull().sum()

In [None]:
# summary statistics of each column. Analogous to summary() in R.
loan.describe().loc[['mean','std']]

Professor Geng thinks, bravely, that this dataset looks ready for modeling. Let's assume Geng is right, and move on.

## 3\. Modeling

Most traditional machine learning algorithms (as compared to deep learning) in Python are nicely bundled in the awesome **scikit-learn** package (more on this package later). 
+ If anyone asks me why Python over R, scikit-learn is my top reason
+ In Python coding this package is named `sklearn`

### Partition the data

This is done using the `train_test_split()` function in `sklearn` package:

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# reserve 20% dataset as testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)    

### Choose the learning algorithm

The scikit_Learn package offers numerous classification and regression learning algorithms, many of which are state-of-the-art choices. You can find them at (https://scikit-learn.org/stable/supervised_learning.html) -- this will be a topic of later weeks.

As the target `not_fully_paid` is discrete: 1 for not fully paid, and 0 for fully paid, the Lending Club problem is a **classification** problem.
+ Unlike R, there's no "factor" data type in Python. 

Let's try *logistic regression* that we've learned last semester.
+ Analogous to glm(..., family="binomial") or train(..., method="glm", family="binomial") in R.

(NOT required for today) scikit-learn is famous for providing high-quality documentations. For example, for logistic regression:
+ You can find detailed explanation of the underlying statistical concepts at (https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression) 
+ You can find detailed coding definitions and examples at (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)


In [None]:
from sklearn.linear_model import LogisticRegression

Since logistic regression cannot handle string data, let us drop column 'purpose' (but is this the proper way to handle?):

In [None]:
X_train = X_train.drop(columns=['purpose'])
X_test = X_test.drop(columns=['purpose'])

### Choose model hyperparameters

Most learning algorithms, including `LogisticRegression` implemented in `sklearn`, have many parameters that need to be explicitly set before we can run them. They are often referred to as **hyperparameters**. Intuitively, our choice of these hyperparameters will control how the learning works.

Example: The first hyperparameter of `LogisticRegression` is `penalty`. Our choice of its value will affect how complicated the trained algorithm is -- for example, how many features are eventually selected.

For today, we'll ignore the topic of how to choose model hyperparameters, and blindly follow Geng's choice below: 

In [None]:
model = LogisticRegression(random_state=0)

### Fit your model (a.k.a. train your model)

Now let's fit/train our model. That is, plugging our Lending Club dataset into the chosen learning algorithm, and hopefully get a trained algorithm.

In [None]:
model.fit(X_train,y_train)

In [None]:
# Run this code cell to see the coefficients of the trained model:
logit_reg_coef = pd.DataFrame(model.coef_[0],index=X_train.columns,columns=['Coef'])
logit_reg_coef

Comment: In glm() in R, specifying the learning algorithm, specifying the hyperparameters, and training are all done in a single line of code. In scikit-learn, however, these are three separate steps. 

One weakness of the scikit-learn package, as compared to R packages, is that it is more into prediction and less into the completeness of stats reporting. For example, `LogisticRegression` does not report the p-value. If you need it, try another package `statsmodels` as follows:
```
import statsmodels.api as sm
logit_model=sm.Logit(y_train,X_train)
result=logit_model.fit()
print(result.summary())
```

## 4\. Evaluation

Now it's time to see how well our trained algorithm performs. First, we apply the trained model to the testing data `X_test` to get predictions. This is done using the `predict()` method.

In [None]:
y_predict = model.predict(X_test)

Second, we compare the predicted values in `y_predict` with the true values we already have in `y_test` using the `accuracy_score()` function:

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_predict).round(4)

Does the accuracy of our trained model, as shown above, appear okay to you?

Hint: try `1-y_test.mean()`, and think.

In [None]:
from sklearn.metrics import confusion_matrix
print("The confusion matrix is:")
cm = confusion_matrix(y_test, y_predict)
print(cm)

**Chances are, we will not stumble upon an ideal trained model in our first try! Or the first many tries!** This will be a repetitive process of observing the results, asking why and forming our ideas on what next we should try, accordingly going back to data wrangling and EDA and model adjustment, and see if our ideas help or not. This will occupy a big chunk of our time in the next 3 weeks.

**So, what should we do next regarding this Lending Club analytics problem?**

## 5\. Deployment

Once we have a champion model that we are happy with, we move it to production. This typically involves converting markdown files to code files (i.e., from .ipynb file to .py file), setting up proper input/output pipes, and task automation (e.g., running it at 8am everyday automatically). (Not required) You can start from this nice [guide](https://medium.com/@thabo_65610/three-ways-to-automate-python-via-jupyter-notebook-d14aaa78de9).

## 6\. Python packages used in this basic flow

### scikit-learn

scikit-learn (https://scikit-learn.org/) is one of the best-known Python libraries for traditional (a.k.a. non-deep) machine learning. Advantages of scikit-learn include:

- A [large selection](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning) of machine learning algorithms
  - efficient implementation (a.k.a. fast) via NumPy and SciPy
- A selection of metrics for measuring model performance
- Excellent online documentation with awesome examples
- A clean, uniform, and streamlined API. Once you learn how to code with one algorithm, switching to other algorithms is usually straightforward

**We will study scikit-learn in two weeks** and use it to build, assess and improve our trained models.

### pandas

Effective machine learning depends on quality data input, which usually doesn't come naturally and requires heavy data wrangling. The pandas package in Python offers a powerful set of data structures and tools for us to manipulate data in a table format. In particular, pandas offer two core data structures. First is **DataFrame**, for example:

In [None]:
loan

Second is **Series**, for example:

In [None]:
loan['fico']

In plain English, a DataFrame is a table of data, and a Series is one column of it. 

Furthermore, both DataFrame and Series are **indexed**. The index can be numerical as above, or string based as below.

In [None]:
df = pd.read_csv('/content/drive/MyDrive/AMA/02a_ML_in_Python/students.csv', index_col='Name')
df

The indexed and table-shaped data structure offered by pandas makes data wrangling highly effective and often easier as compared to other languages. Thus pandas is very popular. **We will study pandas next week** and use it to wrangle data for our Lending Club case.

### NumPy

pandas (and often scikit-learn) in turn depends on NumPy, a package that focuses on one thing: *array computing* that is fast and easy. **Today we next study the NumPy package**.