# A taste of data science
and why we start from Numpy, pandas, and matplotlib...

Let's import some required packages first.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

## Get data
One of the most convenient  
and unrealistic way to read the data  
is by built-in functions of packages.

For example, you can do 
```Python
from sklearn.datasets import <tab>
```
Here `<tab>` means to press the tab key there.

In [None]:
from sklearn.datasets import load_iris
iris = load_iris() ### load the dataset and store it into the namespace `iris`

Now what?

Different datasets have different structures.  
Usually, the peopl who give you the data need to  
tell you the details.

For data accessed by the `sklearn` (scikit-learn) package,  
each data has its own keys.

In [None]:
print(iris.keys())

### The general description is stored in `DESCR`
### and can be access by `iris['DESCR']`
print(iris['DESCR'])

In [None]:
X = iris['data']
y = iris['target']
### uncomment the following lines to 
### understand X and y
# print(X)
# print(y)

In computer, data is usually stored in an array.  
A sample of the data is no different from a collection of numbers.

It is important to understand  
how many **samples** (rows) there are, and  
how many **features** (columns) there are.

In [None]:
print(X.shape)
print(y.shape)

So in this case, 
you understand there are 150 samples of iris flowers are given, and  
for each sample, 4 features are recorded.

`iris['target']` record the species of each sample,  
they are the answers, the **targets**, or the **labels**.

`iris['target_names']` tells you the meaning of each target.

In [None]:
iris['target_names']

As you have seen, 
the structure of the data is important.

The real-world data is unlikely to be as clean as this.

You might be dealing with:  
- pictures of different resolutions (input dimensions)
- pictures with noise (inevitable)
- text with redundant information or redundant formating (e.g., collecting the gender data by Word)
- and so on.

Processing and cleaning the data is important  
and has lots of dirty work involved.  

However, a good project requires  
a smooth cooperation between several reliable works,  
such as 
- collecting the data,
- cleaning the data, 
- analyze the data, 
- data visualization and selling it out.

Each step is important while data analysis  
counts only a tiny proportion.

## Support-vector machine

Find a cutting line (hyperplane) between  
data points form different categories.

![Illustration of support-vector machine](256px-SVM_margin.png "Illustration of support-vector machine")
(Source: Wikipedia &mdash; Support-vector machine)  
(Larhmam [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0))

Let have a glance of the data points.

Think about what are `X[:,0]` and `X[:,1]`.  

You may also try combinations other than 0 and 1.  
For example, you may try 1 and 2.

In [None]:
plt.scatter(X[:,0], X[:,1], c=y, cmap='viridis')

Let's apply SVM anyway  
(without understanding what is going on).

In [None]:
from sklearn.svm import SVC
model = SVC() ### create a support-vector classifier
model.fit(X, y) ### adjust the parameters of the model to fit the data
ymodel = model.predict(X) ### parameters are fixed now, use them to predict the answer

In [None]:
### predicted answers
ymodel

In [None]:
### answers given by human
y

How many are correct?  

Stop counting, use `accuracy_score`.

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(ymodel, y)

However, this does not make sense  
since we use the training data  
to test the accuracy???

Let's separate the given data samples  
into **training data** and **test data**.

In [None]:
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y)

### see the shape of each set
for stg in ['Xtrain', 'Xtest', 'ytrain', 'ytest']:
    print(stg,'shape',eval(stg).shape)

Let's do it again.

In [None]:
model = SVC()
model.fit(Xtrain, ytrain) ### use training data sets only
ymodel = model.predict(Xtest) ### find the prediction of the test data
accuracy_score(ymodel, ytest) ### compute the accuracy for test data

## Neural Network

How much you know about a neural network  
beyond this graph?

![Neural network](256px-Colored_neural_network.svg.png 'Neural network')
(Source: Wikipedia &mdash; Artificial Neural Network)  
(Glosser.ca [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0))

Use `keras` package  
to set up a neural network of the shape  
4, 30, 10, and 3.  

As you can see, there are lots of arguments  
that you have to set up.

- [Keras Documentation &mdash; Activations](https://keras.io/api/layers/activations/)
- [Keras Documentation &mdash; Optimizer](https://keras.io/api/optimizers/)
- [Keras Documentation &mdash; Losses](https://keras.io/api/losses/)
- [Keras Documentation &mdash; Metrics](https://keras.io/api/metrics/)

In [None]:
from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential()
model.add(keras.Input(shape=(4,)))
model.add(layers.Dense(30, activation="tanh"))
model.add(layers.Dense(10, activation="tanh"))
model.add(layers.Dense(3, activation='softmax'))
model.compile(optimizer='sgd', loss='categorical_crossentropy', metrics=['accuracy'])

You get an error when  
you make the model to fit the training data.

In [None]:
model.fit(Xtrain, ytrain)

By default, the `sklearn` package  
wants the targets to be an integer.  
```Python
[1,
1,
3,
2]
```

However, it is common for a neural network  
to use the one-hot encoding.
```Python
[[1,0,0],
 [1,0,0],
 [0,0,1],
 [0,1,0]]
```

In [None]:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()

### create a new training target by one-hot encoding
ytrain_onehot = enc.fit_transform(ytrain[:,np.newaxis]).toarray()

### create a new test target by one-hot encoding
ytest_onehot = enc.fit_transform(ytest[:,np.newaxis]).toarray()

In [None]:
### current setting of keras will set up some configurations when it first sees the data
### so the previous model was contaminated
from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential()
model.add(keras.Input(shape=(4,)))
model.add(layers.Dense(30, activation="tanh"))
model.add(layers.Dense(10, activation="tanh"))
model.add(layers.Dense(3, activation='softmax'))
model.compile(optimizer='sgd', loss='categorical_crossentropy', metrics=['accuracy'], )

In [None]:
model.fit(Xtrain, ytrain_onehot, epochs=20, validation_data=(Xtest, ytest_onehot))

By using a complicated model, e.g., a neural network,  
it can possibly get a better result,  
but it can also potentially gets stuck.  

The strength of a neural network is its versatility.  
It can deal with various problems,  
but the outcomes from a neural network  
are usually not so interpretable.

`model.evaluate` by default gives you  
the value of the loss function and  
the accuracy.

In [None]:
model.evaluate(Xtest,ytest_onehot)

## Conclusion
There are various tools available,  
and they have different strength.  

In order to make everything work well,  
the structure of your data has to  
meet the model's design;  
also, selecting appropriate model arguments (aka **hyperparameters**)  
is a key factor to reach high performance.

Therefore,  
- **understand your data structure**,  
- **learn how to process and manipulate the data**, and  
- **know what you are doing on each model**.