# An introduction to Machine Learning


## Purpose of the course

## What we will see

### Some terminology

## Some references

# Setting un the environment

## Installing Miniconda Python

Miniconda is a free minimal installer for conda. It is a small, bootstrap version of Anaconda that includes only conda, Python, the packages they depend on, some useful ML libraries, and a small number of other useful packages, including pip, zlib and a few others. Use the conda install command to install 720+ additional conda packages from the Anaconda repository.

Install scripts are available for Linux, Windows, and MacOS. Since most of use use a Mac, I will only show you how to install on this OS. Instructions are available for other platforms [here](https://conda.io/projects/conda/en/latest/user-guide/install/index.html).

### Get the installer
Installer for MacOS can be downloaded from [here](https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh).

Then you just need to run 
```bash
$ bash Miniconda3-latest-MacOSX-x86_64.sh
```

You can specify the installation directory, if you so wish. The default is `/home/$USER/miniconda3`.

# Some important libraries  (*there is a module for that...*)

## `numpy`

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

 - a powerful N-dimensional array object;
 - broadcasting functions;
 - tools for integrating C/C++ and Fortran code;
 - useful linear algebra, Fourier transform, and random number capabilities.

NumPy can be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. Since the low-level implementation of numerical routines in NumPy is in C, the calculations are blazing fast.

## `pandas`

`pandas` is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.
`pandas` uses `numpy` under the hood to ensure all data operations are as fast as possible (if you use the right tools).

Pandas is going to be that base of our data handling and we'll get to know it intimately.

## `matplotlib` and `seaborn`

Libraries to generate beautiful plots and graphs. We will use these to view our data.

## `scipy`

The SciPy ecosystem is built on top of Python and NumPy.

It includes:

 - The SciPy library, a collection of numerical algorithms and domain-specific toolboxes, including signal processing, optimization, statistics, and much more.
 - Matplotlib, a mature and popular plotting package that provides publication-quality 2-D plotting, as well as rudimentary 3-D plotting.

On this base, the SciPy ecosystem includes general and specialised tools for data management and computation, productive experimentation, and high-performance computing.

## `scikit-learn`

Efficient library for data analysis and predictive modeling. It provides access to a plethora of Machine LEarning algorithms using a simple and linear API.

## `statsmodels`

Statistical modeling, and hypothesis testing.

-------------

## Deep Learning toolkits

### `Tensorflow`

Tensorflow is an open source platform for Machine Learning developed by Google.

### `Keras`


### `Pytorch`

<img src="https://www.dataiku.com/static/img/learn/guide/getting-started/getting-started-with-python/logo-stack-python.png">

# Some basic syntax

## `pandas` - data manipulation

## `scikit-learn` - ML algorithms

## `keras` - deep learning models

In [8]:
import re

In [56]:
from typing import List

percent_regex = "\d{1,3}\s?%\s?[a-zA-Z]+"
composition_regex = "(self|lining|contrast)"


def get_materials(text: str) -> List[str]:
    materials = re.findall(percent_regex, text.lower())
    
    clean = [i.replace(" ", "").replace("%", "% ") for i in materials]  # needs filter on accepted materials
    components = re.findall(composition_regex, text.lower())
    return clean, components

In [57]:
s = """
    <ul>
    <li>Available in Medium Wash</li>
    <li>Denim Jumpsuit</li>
    <li>Adjustable Straps</li>
    <li>Lined Bust</li>
    <li>Back Zipper</li>
    <li>Stretch</li>
    <li>Self: 45% Cotton 30%Rayon 23% Polyester 2% Spandex </li>
    <li>Lining: 100% Polyester </li>
    <li>Contrast: 50% Polyester 50% cotton</li>
    </ul><br>
"""

In [58]:
get_materials(s)

(['45% cotton',
  '30% rayon',
  '23% polyester',
  '2% spandex',
  '100% polyester',
  '50% polyester',
  '50% cotton'],
 ['self', 'lining', 'contrast'])

In [59]:
x, y = get_materials(s)

In [65]:
import functools
import numpy as np

In [75]:
x[:np.argwhere(np.cumsum([int(i.split("%")[0]) for i in x]) == 100)[0][0] + 1]

['45% cotton', '30% rayon', '23% polyester', '2% spandex']

In [76]:
import pandas as pd
import json

In [94]:
df = pd.read_csv("/Users/pmascolo/Downloads/hs6 -_ customs_description - official-hs6.csv").dropna(
    subset=["customs_description"], axis=0
).fillna("")

In [95]:
df["item"] = df.item.apply(str.strip)

In [105]:
import collections

allowed_labels = collections.defaultdict(dict)

for k in ("material", "construction", "gender"):
    groups = df.groupby("item")[k].unique()
    
    for item, v in groups.iteritems():
        allowed_labels[item][k] = v

In [109]:
dicts = []

for k, v in allowed_labels.items():
    v["item_type"] = k
    dicts.append(v)

In [113]:
pd.DataFrame(dicts).set_index("item_type").to_json("allowed_labels.json", orient="index")

In [104]:
# df.groupby("item")["material"].unique().index

for i in df.groupby("item")["material"].unique().iteritems():
    print(i[0], i[1])

 ['textile']
anorak ['wool' 'cotton' 'manmade fiber' 'textile']
bag ['textile' 'cotton' '' 'synthetic']
bathrobe ['cotton' 'textile' 'manmade fiber']
bed_net ['material_irrelevant']
bedspread ['textile']
blanket ['textile' 'wool' 'cotton' 'synthetic']
boots ['rubber']
bra ['textile']
camping goods ['textile']
cloth ['textile']
clothing_accessories ['']
coin ['']
corset ['textile']
cufflinks ['']
curtains ['synthetic' 'textile' 'cotton']
diamonds ['']
diamonds powder ['']
dress ['wool' 'cotton' 'synthetic' 'artificial fiber' 'textile'
 'artificial fibers']
ensemble ['cotton' 'synthetic' 'textile' 'wool']
footwear ['waterproof' 'rubber' 'leather' 'textile' '']
footwear part ['']
furnishing_article ['']
garment ['cotton' 'synthetic' 'textile' 'rubberised' 'manmade fiber' 'felt']
garment_parts ['material_irrelevant' '']
girdle ['textile']
gloves ['synthetic' 'wool' 'cotton' 'textile' 'material_irrelevant']
gold ['']
gold powder ['']
handkerchief ['cotton' 'textile']
hosiery ['textile']
int

In [None]:
# Probability p_N (current_variant) is better than other variants:
# all variants have 2 associated parameters alpha and beta (successes, failures)


# comparing current_variant vs. all the rest
P = 1                                               # probability of current_variant being better than rest
p_N = Beta(α_current_variant, β_current_variant)    # pdf of current_variant

for variant in other_variants:
    # account for all other variants
    # removing the probability of other variants being better than current_variant
    P -= probability_a_better_b(variant, current_variant)
    

# extra spurious term coming form the maths
extra_term = 1
for variant in variants:
    partial_sum = 0
    for i = 0 to (α_variant - 1):
        numerator = (p_N ** i)(1 - p_N) ** β_variant
        denominator = (β_variant + i) * Beta(1 + i, β_variant)
        partial_sum += numerator / denominator

    extra_term *= partial_sum

    
# I REEALLY do not trust this sign issue here...
# This term should be fairly small, but I'll need to dig deeper into it
if len(variants) % 2 == 0:
    P += extra_term
else:
    P -= extra_term