## Fundamentals of Machine Learning (SOMINDW07)
## Lukas Edman & Joren Arkes

---

#### Original source code by: Mike Zhang, Johannes Bjerva and Malvina Nissim
#### Current maintainer: Joren Arkes

---

This notebook provides a pre-setup working environment for simple machine learning experiments. You are free to setup the code on your own computer as well, of course. You can run each cell of code using `Ctrl+Enter` or by clicking the `Run cell` button. When you load this notebook in a fresh Colab workspace, make sure to run all the code cells under *Preparing our environment* first to setup everything correctly.

# Preparing our environment

We start by moving our environment to the correct Colab working directory. Then we clone the source code for this course using `git`:

https://github.com/jorenarkes/foml


In [None]:
%cd /content/
!git clone https://github.com/jorenarkes/foml.git

We move our environment into the folder that we just downloaded.

In [None]:
%cd foml

We install the Python libraries that we need to run the code through the `pip` package manager.

In [None]:
!pip install matplotlib numpy pandas scikit-learn

# Uploading your own CSV files

The below code cell connects your own Google Drive folder to this notebook runtime. You can use this to load your own CSV datasets.

In most circumstances, your main Drive folder can be accessed under `/content/drive/MyDrive`. If you click this link, you should be able to see your personal Drive files in the sidepanel.


In [None]:
from google.colab import drive

drive.mount("/content/drive")

# Running a complete experiment

We run the Python script with command-line arguments. These command line arguments can be changed in the below cell. Note that you have to re-run the cell to set them.

In [None]:
####  Ignore this: we clear these variables before every run.
%env PLOT=
%env CM=
%env NORMALIZE=
%env TEST=
####

# The location of the input CSV file. If you want to use your own file from Drive, change this into something like /content/drive/MyDrive/my_file.csv
%env CSV_FILE=/content/foml/data/hyperp_subset.csv

# Select the algorithm that you want to use. Valid are:

# nb - Multinomial Naive Bayes
# dt - Decision Tree
# svm - Support Vector Machine
# knn - K-nearest Neighbours

# You can select multiple at a time by seperating them with a space e.g. `nb knn`.
# Note that if you want to configure some parameters for Decision Trees or KNN, you have to do that using other parameters. Ask in class if you need help!

%env ALGORITHMS=nb

# Enter the names of the columns with features here, seperated by a space
# Note that the program automatically detects and uses the `label` column as the target class.
%env FEATURES=text-cat

# You can change the split (train/dev) here
%env SPLIT=70 30

# Change the length of word-level n-grams to encode here
%env NWORDS=1

# Change the length of character-level n-grams to encode here
%env NCHARS=0

# Uncomment below to save a .png plot of the Confusion Matrix
# These plots are saved in the `foml/plot_images` folder
#%env PLOT=--plot

# Uncomment below to print a text-based Confusion Matrix
#%env CM=--cm

# Uncomment below to normalize the plots
#%env NORMALIZE=--norm

# Uncomment below to run the classifier on the test set, instead of the dev set.
#%env TEST=--test

In [None]:
!python run_experiment.py --csv $CSV_FILE --algorithms $ALGORITHMS --features $FEATURES --split $SPLIT --nwords=$NWORDS --nchars=$NCHARS $PLOT $CM $NORMALIZE $TEST

## Some more examples

#### Show the help info for this program; displays a wide range of configuration options. You in fact don't have to use the above cells to run an experiment - you can choose your own command-line parameters as well!

In [None]:
!python run_experiment.py --help