New! My Open Source Google Colab Notebooks on this topic:
Converting various common Matlab statistical and machine learning tools into Python.
- Motivation and Background
- Personal Motivation
- Origin and History of MATLAB vs. New Paradigm
- A Word on Matlab Python Translators
- Assumptions on Machine Used & Tool Prerequisites
- Command Line Familiarity
- Versioning (Old Notice)
- How I Put This Guide together
- Installation Requirements to Run on Local server
- Installation of libraries
- Installation of Prerequisites
- Data Sets - Generating, Displaying, Reading
- Data Encoding & Preprocessing
- Tools Used In Preprocessing
- 01.1 Generating Data, Declaring variables
- 01.2 Reading Data
- 02 Numeric Types
- 03 Matrix Work
- 04 Plotting
- 05 Performance
- 06 Functions
- 07 Classes
- 08 Tables
- 09 Cells
- 10 Control Statements
- 11 Set Operations
- 12 Casting
- 13 Strings
- 14 Operators
- 15 Functions
- 16 Lambda
- 17 Inheritance
- 18 Scope
- 19 Modules
- 20 Dates
- 30 Containers
- 31 Machine Learning Intro
- Tools Used In Preprocessing
- Two Types of Supervised Learning Tasks
- Regression Using Parametric Modeling
- Classification Using Parametric Modeling
- K-Nearest Neighbor Regression
- Piecewise Linear Regression
This repo is created for those who are familiar with a Matlab background, but have an interest in being able to do some of the same types of statistical computing and machine learning in Python.
My personal motivation from this project comes from the fact that I studied electrical engineering, which involved the use of Matlab. As computing has inevitably become more powerful over the years, the use of terms such as, "Machine Learning," and "Artificial Intelligence," has become democratized. Yet fundamentally, all these terms really amount to is the ability to do math at volume.
The, "at volume" side of that statement has to do with the capability to deploy software onto a server.
Matlab was never really geared for being able be re-deployable, to work in a, "production" environment.
- Academic, highly-maintained, highly structured.
- Designed with the PC era in mind which means individual contributors, focusing on mathematical models.
- It's designed for the speed of the individual data scientist to be able to move quickly, without code getting in the way.
- Does not allow for cheap deployment and reproducibility, because you have to buy a license for each distribution.
- This is in contrast to Python and open source data science tools, which while clunkier in terms of setup and lack of tools, can be infinitely used without the need for licensing.
- Some of the best, most interesting mathematical practitioners out there are trained in and fluent in Matlab, but are not familiar with Python and its various tools.
It is my help that individuals may be able to use this repo as a guide to translate from MATLAB to Python and assist their careers and work.
Some Matlab-Python translators exist, here are some of those which I have been able to find below.
- https://github.com/victorlei/smop Appears to be continuously updated, high reputation score at 380+ stars.
- https://github.com/awesomebytes/libermate
- https://github.com/miaoever/Mat2py
- https://github.com/buguen/mat2py
Translators are well and good, however my thought is that a fundamental understanding of the underlying computing and why certain technical decisions are made is superior to simply plugging things into a translator. That being said, the above resources are provided depending upon your use case.
https://sebastianraschka.com/blog/2014/matrix_cheatsheet_table.html
http://mathesaurus.sourceforge.net/matlab-numpy.html
https://www.reddit.com/r/MachineLearning/comments/5x9tyi/r_which_libraries_to_learn_to_get_started_with_ml/ https://www.reddit.com/r/MachineLearning/comments/6h75h2/d_random_question_why_has_matlaboctave_not_been/
"Big Data" is a fairly innocuous term, but generally it may be reasonable to define it as meaning "files that do not fit into available memory." One potential criticism of Matlab is that it does not have the capability to expand beyond a single computer, and hence is not suitable for Big Data. Of course, Matlab is actually an ecosystem and is constantly be developed by the Mathworks company.
In short, there are a wide variety of data types and storage systems that Matlab can handle. This guide is not meant to supplant Matlab, but rather to create understanding of the differences between the two languages and available toolsets/libraries.
Some people may find Matlab useful, some people may find Python useful, for different types of projects.
I used a Macbook Air 2015 for all of this, running OSX. There may be some slight variations in tools and techniques used if you are on a different operating system.
This is important if you're going to build an app based upon the code that you're using, since much of webapps are built based upon linux distributions. If you're not familiar, I recommend you check out containers for deploying and distributing to systems other than personal computers and laptops, which may be the level of familiarity that many working within the Matlab environment may start off with.
In short, MATLAB is essentially a complete package with a bunch of libraries and ways of accessing data installed.
Moving to python means you are working, "completely from scratch" and importing a bunch of tools which give you Matlab functionality. To ensure that this works across mupltiple systems, you use containers to ensure consistency.
Aside from, "on-machine" methods of deploying Python, there are also Notebooks, such as Jupyter notebooks, which - once a container as the one mentioned above is installed, provide a fast and easy way to write and notate code and psuedocode while playing around.
The next level up from this would be online versions of Notebooks, such as:
- Microsoft Azure Notebooks - Jupyter Notebooks
- Gooogle Colab
- ...And other alternatives which you could look for yourself!
All of the commands and ssh work mentioned were done using the command line (terminal). You will need a basic familiarity with using a machine command terminal for the following.
This may be old information for many, but I'm putting this here just in case...
- If you're not familiar with Python or various open source computing languages, you should know that there are various versions of Python, the most common of which are Python 2 and Python 3.
- Python 2 is being retired in 2020 permanently and will not be maintained. So it is recommended to upgrade and use Python 3 and do not use Python 2 at all.
Basically, I have a high familiarity with MATLAB for having worked with it for years, having done a wide variety of data science projects over time including image processing. Typically I would learn about Matlab, and indeed new forms of mathematics that I had not heard about by reading the Matlab documentation and user forums on Mathworks. Having taken a class during undergraduate back in 2006 taught by Vladimir Cherkassky I compiled some of the thoughts from the book:
Then, I researched various machine learning tools embedded within Python by reading various "top data science toolkits for python" articles such as this one.
Over time I built in additional fundamentals through some of my own, separate, closed source programming in Python as I went along.
I also got recommendations for toolkits to look into from various friends who work in data science, including Mike Semeniuk.
Requirements:
-
python3
-
Numpy Numpy is a library within Python that allows for the matrix-style manipulation of numbers inherent in Matlab.
-
SciPy Scipy is a library within Python that allows various types of linear algebra, as well as compiling into C++.
-
Scikit-Learn
-
Pandas Pandas or Python ANalysis of DAta library allows some of the tabular data structure functionality of Matlab.
-
Matplotlib Matplotlib is basically a plotting library within Python, allows you to do graphs.
-
PostGresql PostGreSQL is a simple, widely known dynamic database. PostGreSQL Installation Documentation
-
MySQL MySQL is a widely known dynamic database. Guide to Install MySQL with Hombrew Download Community Version
-
pip3 (which is the Python 3 version).
-
Homebrew.
-
Anaconda & setup of Python environments
-
Jupyter notebook and use of proper environments - particularly the python3 environment.
You must install the following from ssh using the following commands:
- Python3 documentation
- Numpy documentation pip3 install numpy
- Pandas documentation pip3 install pandas
-
Using Python 3.6 Setup Anaconda Environment for Python3 pip install numpy pip install matplotlib sudo ipython -m pip install mpld3 # install javascript display for matplotlib charts
-
Use Jupyter Notebook with Python3 Anaconda Environment.
Using Matlab 2017, Student Version.
We can generated data sets to play around with using random number functions. What this allows us to do is quickly and easily generate dummy data for the purposes of doing statistical analysis or various data science exercises.
Reading from a CSV File is a simple, well-known way of grabbing data for analysis purposes within the Matlab world. Examples on how to do this are shown in:
1.2.1 readcsv
This file allows you to read a csv file of arbitrary length
Using a dynamic database server environment is the more established way of working with data in a "server type" environment. Here we're showing an example using PostGres since it is a well-known, easy to use database in web and cloud.
To get a PostGres Database set up locally on your machine, you can simply use the installation from a command line:
(using HomeBrew https://brew.sh/)
$ brew update $ brew doctor $ brew install postgres
To maniuplate a postgres database, you can use a simple free program known as Postico.
To start a sample database on your machine, you can use the command:
$ brew services start postgresql
The word "start" is the operative word here. You can also use "restart" to start the service.
After you have started a server, you can connect to it via Postico.
- Create a new Database - matlab-Python
- Create a new Table - sample
- Decide columns, names and types. (datatypes here https://www.w3schools.com/sql/sql_datatypes.asp)
- Edit content directly within Postico.
Analysis of Speed of Insertation into PostGresql database https://alliedtesting.github.io/pgmex-blog/2017/06/29/performance-comparison-of-postgresql-connectors-in-matlab-part-I/
sqlwrite https://www.mathworks.com/help/database/ug/database.odbc.connection.sqlwrite.html
In order to use this database, you have to buy the Matlab database package. We're not going to go through how to use this here.
We use the most popular PostGresql Python module, psycopg - http://initd.org/psycopg/download/ Download and install as shown:
$ pip install -U pip # make sure to have an up-to-date pip $ pip install psycopg2
To install the Python3 version, use:
$ pip3 install psycopg2
Within Machine Learning, a lot of the, "basic computational stuff" that one learns in a beginner computer science class - meaning the control statements such as, "if, statements, for loops, while loops, etc." - as well as data types, functions, types of arrays, classes, - the bread and butter which might normally be defined as, "software programming" - might be referred to by a Machine Learning enthusiast as, "Data Encoding and Preprocessing."
To a Machine Learning practitioner, they may define input and output values in a couple different ways:
- Numeric variables which have a value such as speed, mass, volume or...
- Categorical variables which are not ordered nor defined by distance such as A, B, Left, Right, Orange, Blue, Nice, Mean.
- There is a third type known as an ordered categorical variable, which has order but not distance, such as A1, A2, A3 or Fast, Medium, Slow.
Preprocessing includes scaling numerical data and ensuring that categorical variables are uniform. For numeric inputs and outputs, this may include simple statistical analysis in order to better understand the data - removing outliers for example. It may also involve scaling, for example scaling all data into a range from 1 to 0.
Preprocessing Tools are organized by folders in this Repo. We are basically just going straight down the line in terms of letting individuals know what types of tools are available and what the analog between Python and MATLAB are.
- 1.1.1 "generate" shows how to generate random numbers, between 1 and 100.
- Completed for Python
- Completed for MATLAB
- 1.1.2 "writecsv" shows how to write this data into a CSV file.
- Completed for Python
- 1.2.1 "readcsv" shows how to read from a csv file, then put on a scatter plot
- Completed for Python
- Completed for MATLAB
Comparing numeric types whether integers, doubles, etc. These types are best described in the Numpy documentation.
Python does not really have, "Matricies" as defined in MATLAB (which actually stands for Matrix-Lab). Discussion on some analogs.
We wrote some examples using Numpy. Numpy is good for arrays and creating what are known as, "numbered array." While Matlab arrays are immediately understood to be numerical arrays, an, "Array" within python is more like a regular "computer programming" array that contains a bunch of values, but it's not necessarily optimized or set up well for doing math. That's what the plugin, "Numpy" is for. Numpy includes what are known as, "numbered arrays" or "ndarray" - which have easily accessible values and all sorts of other features much more similar to how Matlab functions.
Pandas is a seperate plugin, which works in conjunction with Numpy, but allows one to much more easily work with the whole, "Matrix" concept. Conceptually, Pandas is more like a set of Worksbook Spreadsheets, with cells that cannot be merged - so a bit more like Structs in Matlab.
This seems to be, at the time of writing, the crux between the difference between Matlab and Python - while Python or Python/Pandas is not immediately set up with the assumption that you are immediately going to do math, it does give the ability to work with what in Matlab would be known as Structs or Cells.
Within Matlab, a Matrix must be homogenous in terms of the type of data within it - which hypothetically could reduce errors. In Python, you may need to write tests to ensure that the data you are using across different values is consistent. Matlab will allow you to immediately test A*B and tell you whether it will work and if not why not from a numerical type standpoint. Python may have a bunch of other errors that you may have to read through and deal with.
Structs and Cells take a bit more coding and thinking to work with, but not a whole lot - and they give more flexibility with the capability to use dot notation to apply functions to them. While Matricies in Matlab are super easy to just get going with, leaving, "less in the way" between the user and Math, there are
- Matrix Sizing
Not Done
Not Done
Not Done
Not Done
Not Done
Not Done
Not Done
Not Done
Not Done
Not Done
Not Done
Not Done
Supervised learning is when the outputs of a function include direct, ground-truth values, e.g. the answers to the output are given. The "supervision," comes from the fact that those output values, "supervise," the algorithm to make sure it's performing well, in a sense, hence the name, "supervised learning."
- Regression - where real valued function estimates are given, and the performance of a prediction is measured usually by the square of a loss.
- Classification - such as binary classification, where an indicator seperates a space into two regions and classified the data points. Performance of the prediction in the case of binary classification is usually measured as either, "right or wrong," for any given point. Loss is calulated as "actual" vs. predicted in this case - so it's all very application dependant.
Loss functions should not be blindly adopted for all applications.
Not Done