<a href="https://colab.research.google.com/github/lindajune/handson-2021-code-testing/blob/main/04_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Assignment One (debugger to run till completion)

---
This assignment covers the basic steps in running regression analysis with Python. Please test the code by identifying its errors and/or exceptions to make it run to completion. Hope this would help you get familiar with python coding. 

## 1. Simple Linear regression with scikit-learn
- about [scikit-learn](https://scikit-learn.org/stable/getting_started.html): it is probably the most useful library for machine learning in Python. The **sklearn** library contains a lot of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction.

- when sucessfully run: intercept should be -0.572, the regression coeffcient should be 2.019

In [None]:
## create data
import matplotlib.pyplot as plt
rng = np.random.RandomState(20)
x = 10*rng.rand(50)
y = 2*x-1+rng.rand(50)

## define the linear regression model
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True)
model

## model fitting
model.fit(x,y);

## display the parameters
print('The coefficient is ' model.coef_)
print('The intercept is ' model.intercept_)

## model plot
xfit = np.linspace(-1,11)
xfit = xfit[:,np.newaxis]
yfit = model.predict(xfit)
plt.scatter(x,y)
plt.plot(xfit,yfit)

## 2. Multivariable linear regression with statsmodels

- about [statsmodels](https://www.statsmodels.org/stable/index.html): it is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.

- when sucessfully run: 
  1.   display the sample size (85, 4)
  2.   show the descriptives for the continous variable "Lottery", "Literacy", "Wealth" and the categorical variable "Region" as:
![output 1](https://drive.google.com/uc?export=view&id=1GsrOhd28zEcTpVpCymIrGjrXgbBOQdOw)

  3. show the regression summary as:

![output 1](https://drive.google.com/uc?export=view&id=1Q6wDT8xzuM79iFlss2KtXJzSLYXcJKVk)

In [None]:
## load modules and functions
import numpy as np
import statsmodels.api as sm

## get data
ds = sm.datasets.get_rdataset("Guerry", "HistData", cache=True)
df = ds.data[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
df.head()

## check the sample size
df.shape()
print('\n')

## statistical summary
print(df.describe)
print('\n')
print('--> counts for each Region:')
print(df('Region').value_counts())

In [None]:
## fit the model
mod = ols(formula='Lottery ~ Literacy + Wealth + Region')
res = mod.fit();
print(res.summary)

## Could you ...
write a funtion to run multiple regression analysis in a loop?

i.e.,

    analysis 1: formula='Lottery ~ Literacy + Region'
    analysis 2: formula='Lottery ~ Wealth + Region'


In [None]:
def 


for 



#Assignment Two (practice to minimize careless errors)

---
This assignment covers the basic steps of machine learning with Python. According to the guidelines below, please modify and test the code with the tools you preferred to make it run correctly.

1. Load modules and functions in front




In [None]:
## load modules and functions

2. Define variables to avoid data mis-use (e.g., data path)

In [None]:
## Load the dataset
from pandas import read_csv
dataset = read_csv("https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv", names=['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class'])

3. Inspect the data before analysis

In [None]:
## Inspect the data
# Peek at the data
print ('<====== Display the top 5 samples ======>')
print(dataset.head(5))

# Dimensions of Dataset
print ('\n')
print ('<======   Dimensions of Dataset   ======>')
print(dataset.shape)
input("How many variables in this dataset: ")
input("How many samples in this dataset: ")

# statistical summary
print ('\n')
print ('<======   Descriptive Statistics   ======>')
print(dataset.describe())
# class distribution
print(dataset.groupby('class').size())

# data visualization: plot the histogram of continuous variables
print ('\n')
print ('<===========     Histogram     ===========>')
import matplotlib.pyplot as plt
# histograms
dataset.hist()
plt.show()

4. Define the training & testing data before modeling
- about [cross validation](https://machinelearningmastery.com/introduction-to-random-number-generators-for-machine-learning/): it is a resampling procedure used to evaluate machine learning models on a limited data sample. Usually, we split the data into training and testing. You may hear about leave-one-out, 5-fold, or 10-fold. They are all the procedure that refers to the number of groups that a given data sample is to be split into. 

    Training set is for build-up the model (or classifier), while test set is the unseen data to apply the model.

- about [Random Number Generators](https://machinelearningmastery.com/introduction-to-random-number-generators-for-machine-learning/): 


In [None]:
from sklearn.model_selection import train_test_split

# Split-out training and test dataset
array = dataset.values
X = array[:,0:4] # features
y = array[:,4]   # labels

# 4/5 samples are training, 1/5 samples are test
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.20, random_state=1)

- check the sample size of training and test set



5. Compare different algorithms (OPTIONAL)
- This part is for you to play if you have additional time
- Purpose: 
> build multiple different models to predict the label (i.e., species from flower measurements); select the best model




In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
# check algorithms
models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('KNN', KNeighborsClassifier()))
models.append(('SVM', SVC(gamma='auto')))
models

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
# evaluate each model in turn
results = []
names = []
for name, model in models:
	kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
	cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy')
	results.append(cv_results)
	names.append(name)
	print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))
 
# Compare Algorithms
plt.boxplot(results, labels=names)
plt.title('Algorithm Comparison')
plt.show()

6. Build model and evaluate with test data: 
- if run correctly, the accuracy is 0.967

In [None]:
# Build model with Training data
model = SVC(gamma='auto')
model.fit(X_train, Y_train)

# Make predictions on Test data
predictions = model.predict(X_train)

# Evaluate predictions
from sklearn.metrics import accuracy_score
print(round(accuracy_score(Y_train, predictions),3))