$\def\*#1{\mathbf{#1}}$
$\DeclareMathOperator*{\argmax}{arg\,max}$
# Data Types

## Imports

In [None]:
import matplotlib as mpl
# pyplot : Provides a MATLAB-like plotting framework
import matplotlib.pyplot as plt
import numpy as np

# %matplotlib notebook
%matplotlib

## Data Matrix

The data set is represented by a $n \times d$ **data matrix** :

$$
D = 
\begin{pmatrix}
  x_{1,1} & x_{1,2} & \cdots & x_{1,d} \\
  x_{2,1} & x_{2,2} & \cdots & x_{2,d} \\
  \vdots  & \vdots  & \ddots & \vdots  \\
  x_{n,1} & x_{n,2} & \cdots & x_{n,d} 
\end{pmatrix}
$$

* The *i*-th **row** refers, depending on the application, to an *entity*, *instance*, **record**, *transaction*, *alternative*,...

$$\*x_i = (x_{i1}, x_{i1}, \ldots, x_{id})$$

* The *j*-th **column** refers to an *attribute*, **feature**, *dimension*, *criteria*,... 

$$X_j = (x_{1j}, x_{2j}, \ldots, x_{nj})$$

$$
D = 
\left(
\begin{array}{c|cccc}
        & X_1 & X_2 & \cdots & X_d\\
        \hline
  \*x_1 & x_{1,1} & x_{1,2} & \cdots & x_{1,d} \\
  \*x_2 & x_{2,1} & x_{2,2} & \cdots & x_{2,d} \\
  \vdots & \vdots  & \vdots  & \ddots & \vdots  \\
  \*x_n & x_{n,1} & x_{n,2} & \cdots & x_{n,d} 
\end{array}
\right)
$$

## Iris Data Set

 | sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | Type of iris plant |
 | ----------------- | ---------------- | ----------------- | ---------------- | ------------------ |
 | 5.1               | 3.5              | 1.4               | 0.2              | Setosa             |
 | 4.8               | 3.0              | 1.4               | 0.3              | Setosa             |
 | 6.0               | 3.4              | 4.5               | 1.6              | Versicolor         |
 | 6.8               | 3.0              | 5.5               | 2.1              | Virginica          |
 | 6.7               | 3.1              | 5.6               | 2.4              | Virginica          |

In [None]:
import numpy as np

filename = '../datasets/iris.data'

classes = {b'Iris-setosa': 0, b'Iris-versicolor': 1, b'Iris-virginica': 2}

classes_converter = {4: lambda c: classes[c]}

data = np.loadtxt(filename, delimiter=',', skiprows=1, converters=classes_converter)

print(data)

In [None]:
i = 3
xi = data[i]
print(xi)

In [None]:
j = 1
Xj = data[:,j]
print(Xj)

In [None]:
print(data[0:5,:])

In [None]:
Xj = data[0:5,j]
print(Xj)

In [None]:
from pandas import read_csv

df = read_csv('../datasets/iris.data')

df.head()

In [None]:
type(df)

In [None]:
df.shape

In [None]:
df.info()

In [None]:
i = 3
xi = df.iloc[i]
print(xi)

## Attributes

* Numeric attributes

* Categorical attributes

## Numeric (quantitative) Attributes

* `domain(age)` = $\mathbb{N}$
* `domain(petal length)` = $\mathbb{R}_{>0}$
* **discrete** : finite or countably infinite set of values
* **continuous** : any real value

**Measurement scales**

* **Interval scale** :
    * Only addition and substration make sense. 
    * The *zero point* does not indicate the absence of measurement. 
    * The `temperature` measured in $^{\circ}C$ is interval-scaled. If two measurements of $20 ^{\circ}C$ and $10 ^{\circ}C$ are compared, what is the right statement ?
        * There is a temperature drop of $10 ^{\circ}C$.
        * The second measure is twice as cold as the first one.
* **Ratio scale**
    * Addition, substraction, and ratio make sense.
    * The `Age` attribute is ratio-scaled.
    * The `temperature` mesured in *Kelvin* is ratio-scaled. 

## Categorical (qualitative) Attributes
* A set of symbols, for example : 
    * `domain(Education) = {HighSchool, BS, MS, PhD}`
    * `domain(Fruits) = {Orange, Apple}`

**Measurement scales**

* **Nominal scale** : values are *unordered* 
* **Ordinal scale** : values are *ordered* 

## Geometric View

In [None]:
fig, ax = plt.subplots()

ax.set_xlabel('Sepal length')
ax.set_ylabel('Sepal width')

X = data[:,0:4]
Y = data[:,4]

ax.scatter(X[:, 0], X[:, 1], c=Y)

In [None]:
fig, axs = plt.subplots(4, 4)
attributes = ['sepal length', 'sepal width', 'petal length', 'petal width']
for i in range(4):
    axs[i, 0].set_ylabel(attributes[i])
    axs[-1, i].set_xlabel(attributes[i])
    for j in range(4):
        axs[i, j].scatter(X[:, i], X[:, j], c=Y)
plt.tight_layout()

### Data binning

In [None]:
fig, ax = plt.subplots()

n, bins, patches = ax.hist(X[:,0], bins=10, edgecolor='black', linewidth=1)
plt.xticks(bins)
plt.xlabel(attributes[0])

In [None]:
fig, ax = plt.subplots()
hist = ax.hist([1,1,1,2,2,4,4], bins=3, edgecolor='black', linewidth=1)

## Dependency-oriented data

Relationships between data items :

* **Time-Series** : data generated by continouous measurement over time
    * *environmental sensor* : temperature, pressure
    * *finantial market analysis*
* **Discrete Sequences**
    * *event logs* such as web accesses : Client IP, Web page address
    * *strings*
* **Spatial** : non-spatial attributes measured at spatial locations
    * *hurricane forecasting* : sea-surface temperature, pressure
* **Spatiotemporal**
* **Network and Graph Data**
    * *Web graph*
    * *Social networks*



## Text Data

* A **string** : a discrete sequence of characters
* **Vector-space representation** : words (terms) frequencies (normalized with respect to the document length)
    * **Document-term matrix** : $n$ documents $\times$ $d$ terms

In [None]:
# import scikit-learn : Machine Learning in Python

# See : http://scikit-learn.org/stable/modules/feature_extraction.html
# and and http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
from sklearn.feature_extraction.text import CountVectorizer

# CountVectorizer : Convert a collection of text documents to a matrix of token counts
vectorizer = CountVectorizer()

corpus = ['This is the first document.',
          'This is the second second document.',
          'And the third one.',
          'Is this the first document?']

# Learn the vocabulary dictionary and return term-document matrix.
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
X.toarray()

## Graph Data

A graph $G = (V, E)$ with $n$ ***vertices*** and $m$ ***edges*** consists of:

* $V = V(G)$ : a vertex set; $n = |V|$ is the order of $G$
* $E = E(G)$ : a set of pairs of vertices, called edges; $m = |E|$

A ***weighted graph*** is a graph $G = (V, E)$ in which each edge $e \in E(G)$ is given a numerical weight $w(e)$, where $w : E \rightarrow \mathbb{R}$.

In [None]:
import networkx as nx
    
def draw_weighted_graph(g):
    pos = nx.spectral_layout(g)
    nx.draw_networkx(g, pos)
    edge_labels = {edge[0:2]: edge[2]['weight'] for edge in g.edges(data=True)}
    nx.draw_networkx_edge_labels(g, pos, edge_labels)

g = nx.Graph()
    
g.add_nodes_from(['Lille', 'Paris', 'Amiens', 'Arras'])
g.add_edge('Lille', 'Paris', weight=225)
g.add_edge('Lille', 'Amiens', weight=62.7)
g.add_edge('Lille', 'Arras', weight=52.7)
g.add_edge('Paris', 'Amiens', weight=144.4)
g.add_edge('Paris', 'Arras', weight=185.8)
g.add_edge('Amiens', 'Arras', weight=62.6)

draw_weighted_graph(g)