Skip to content

Latest commit

 

History

History
6096 lines (3484 loc) · 668 KB

README.md

File metadata and controls

6096 lines (3484 loc) · 668 KB

Awesome Machine Learning On Source Code


Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data. Data science is related to data miningdeep learning and big data.

Data science is a "concept to unify statisticsdata analysismachine learning and their related methods" in order to "understand and analyze actual phenomena" with data. It uses techniques and theories drawn from many fields within the context of mathematicsstatisticscomputer science, and information scienceTuring award winner Jim Gray imagined data science as a "fourth paradigm" of science (empiricaltheoreticalcomputational and now data-driven) and asserted that "everything about science is changing because of the impact of information technology" and the data deluge.


Early usage

In 1962, John Tukey described a field he called “data analysis,” which resembles modern data science. Later, attendees at a 1992 statistics symposium at the University of Montpellier II acknowledged the emergence of a new discipline focused on data of various origins and forms, combining established concepts and principles of statistics and data analysis with computing.

The term "data science" has been traced back to 1974, when Peter Naur proposed it as an alternative name for computer science. In 1996, the International Federation of Classification Societies became the first conference to specifically feature data science as a topic. However, the definition was still in flux. In 1997, C.F. Jeff Wu suggested that statistics should be renamed data science. He reasoned that a new name would help statistics shed inaccurate stereotypes, such as being synonymous with accounting, or limited to describing data. In 1998, Chikio Hayashi argued for data science as a new, interdisciplinary concept, with three aspects: data design, collection, and analysis.

During the 1990s, popular terms for the process of finding patterns in datasets (which were increasingly large) included “knowledge discovery” and "data mining."

Modern usage

The modern conception of data science as an independent discipline is sometimes attributed to William S. Cleveland. In a 2001 paper, he advocated an expansion of statistics beyond theory into technical areas; because this would significantly change the field, it warranted a new name. "Data science" became more widely used in the next few years: in 2002, the Committee on Data for Science and Technology launched Data Science Journal. In 2003, Columbia University launched The Journal of Data Science. In 2014, the American Statistical Association's Section on Statistical Learning and Data Mining changed its name to the Section on Statistical Learning and Data Science, reflecting the ascendant popularity of data science.

The professional title of "data scientist" has been attributed to DJ Patil and Jeff Hammerbacher in 2008. Though it was used by the National Science Board in their 2005 report, "Long-Lived Digital Data Collections: Enabling Research and Education in the 21st Century," it referred broadly to any key role in managing a digital data collection.

There is still no consensus on the definition of data science and it is considered by some to be a buzzword.

Careers in data science

Data science is a growing field. A career as a data scientist is ranked at the third best job in America for 2020 by Glassdoor, and was ranked the number one best job from 2016-2019. Data scientists have a median salary of $118,370 per year or $56.91 per hour. Job growth in this field is also above average, with a projected increase of 16% from 2018 to 2028. The largest employer of data scientists in the US is the federal government, employing 28% of the data science workforce. Other large employers of data scientists are computer system design services, research and development laboratories, and colleges and universities. Typically, data scientists work full time, and some work more than 40 hours a week.

Educational path

In order to become a data scientist, there is a significant amount of education and experience required. The first step in becoming a data scientist is to earn a bachelor's degree, typically in a field related to computing or mathematics. Coding bootcamps are also available and can be used as an alternate pre-qualification to supplement a bachelor's degree in another field. Most data scientists also complete a master’s degree or a PhD in data science. Once these qualifications are met, the next step to becoming a data scientist is to apply for an entry-level job in the field. Some data scientists may later choose to specialize in a sub-field of data science.

Specializations and associated careers

  • Machine Learning Scientist: Machine learning scientists research new methods of data analysis and create algorithms.
  • Data Analyst: Data analysts utilize large data sets to gather information that meets their company’s needs.
  • Data Consultant: Data consultants work with businesses to determine the best usage of the information yielded from data analysis.
  • Data Architect: Data architects build data solutions that are optimized for performance and design applications.
  • Applications Architect: Applications architects track how applications are used throughout a business and how they interact with users and other applications.

Impacts of data science

Big data is very quickly becoming a vital tool for businesses and companies of all sizes. The availability and interpretation of big data has altered the business models of old industries and enabled the creation of new ones. Data-driven businesses are worth $1.2 trillion collectively in 2020, an increase from $333 billion in the year 2015. Data scientists are responsible for breaking down big data into usable information and creating software and algorithms that help companies and organizations determine optimal operations. As big data continues to have a major impact on the world, data science does as well due to the close relationship between the two.


16 Important Data Science Papers:


Machine Learning Formulas:


Best Reference Books - Database Concepts and System:


Top 9 Data Science Algorithms:


Top 11 Data Structure Books:


Books and articles about Flowcharts:


Lecture Notes:

  • Introduction, linear classification, perceptron update rule (PDF)
  • Perceptron convergence, generalization (PDF)
  • Maximum margin classification (PDF)
  • Classification errors, regularization, logistic regression (PDF)
  • Linear regression, estimator bias and variance, active learning (PDF)
  • Active learning (cont.), non-linear predictions, kernals (PDF)
  • Kernal regression, kernels (PDF)
  • Support vector machine (SVM) and kernels, kernel optimization (PDF)
  • Model selection (PDF)
  • Model selection criteria (PDF)
  • Description length, feature selection (PDF)
  • Combining classifiers, boosting (PDF)
  • Boosting, margin, and complexity (PDF)
  • Margin and generalization, mixture models (PDF)
  • Mixtures and the expectation maximization (EM) algorithm (PDF)
  • EM, regularization, clustering (PDF)
  • Clustering (PDF)
  • Spectral clustering, Markov models (PDF)
  • Hidden Markov models (HMMs) (PDF)
  • HMMs (cont.) (PDF)
  • Bayesian networks (PDF)
  • Learning Bayesian networks (PDF)
  • Probabilistic inference - Guest lecture on collaborative filtering (PDF)

Books:


22 Algorithms Books Every Programmer Should Read:


Assignments:


Bayes' Theorem – The Forecasting Pillar of Data Science:


Essential Math for Data Science:


Data Science Case Studies:


Data Science Tutorials for Beginners:


Lecture Notes by Andrew Ng:


50 selected papers in Data Mining and Machine Learning:

General

Data Mining and Statistics: What’s the Connection?

Data Mining: Statistics and More?, D. Hand, American Statistician, 52(2):112-118.

Data Mining, G. Weiss and B. Davison, in Handbook of Technology Management, John Wiley and Sons, expected 2010.

From Data Mining to Knowledge Discovery in Databases, U. Fayyad, G. Piatesky-Shapiro & P. Smyth, AI Magazine, 17(3):37-54, Fall 1996.

Mining Business Databases, Communications of the ACM, 39(11): 42-48.

10 Challenging Problems in Data Mining Research, Q. Yiang and X. Wu, International Journal of Information Technology & Decision Making, Vol. 5, No. 4, 2006, 597-604.


General Data Mining Methods and Algorithms

Top 10 Algorithms in Data Mining, X. Wu, V. Kumar, J.R. Quinlan, J. Ghosh, Q. Yang, H. motoda, G.J. MClachlan, A. Ng, B. Liu, P.S. Yu, Z. Zhou, M. Steinbach, D. J. Hand, D. Steinberg, Knowl Inf Syst (2008) 141-37.

Induction of Decision Trees, R. Quinlan, Machine Learning, 1(1):81-106, 1986.


Web and Link Mining

The Pagerank Citation Ranking: Bringing Order to the Web, L. Page, S. Brin, R. Motwani, T. Winograd, Technical Report, Stanford University, 1999.

The Structure and Function of Complex Networks, M. E. J. Newman, SIAM Review, 2003, 45, 167-256.

Link Mining: A New Data Mining Challenge, L. Getoor, SIGKDD Explorations, 2003, 5(1), 84-89.

Link Mining: A Survey, L. Getoor, SIGKDD Explorations, 2005, 7(2), 3-12.

Semi-supervised Learning

Semi-Supervised Learning Literature Survey, X. Zhu, Computer Sciences TR 1530, University of Wisconsin — Madison.

Learning with Labeled and Unlabeled Data, M. Seeger, University of Edinburgh (unpublished), 2002.

Person Identification in Webcam Images: An Application of Semi-Supervised Learning, M. Balcan, A. Blum, P. Choi, J. lafferty, B. Pantano, M. Rwebangira, X. Zhu, Proceedings of the 22nd ICML Workshop on Learning with Partially Classified Training Data, 2005.

Learning from Labeled and Unlabeled Data: An Empirical Study across Techniques and Domains, N. Chawla, G. Karakoulas, Journal of Artificial Intelligence Research, 23:331-366, 2005.

Text Classification from Labeled and Unlabeled Documents using EM, K. Nigam, A. McCallum, S. Thrun, T. Mitchell, Machine Learning, 39, 103-134, 2000.

Self-taught Learning: Transfer Learning from Unlabeled Data, R. Raina, A. Battle, H. Lee, B. Packer, A. Ng, in Proceedings of the 24th International Conference on Machine Learning, 2007.

An iterative algorithm for extending learners to a semisupervised setting, M. Culp, G. Michailidis, 2007 Joint Statistical Meetings (JSM), 2007

Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers, V. Sheng, F. Provost, P. Ipeirotis, in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008.

Logistic Regression for Partial Labels, in 9th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, Volume III, pp. 1935-1941, 2002.

Classification with Partial labels, N. Nguyen, R. Caruana, in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008.

Induction of Decision Trees from Partially Classified Data Using Belief Functions, M. Bjanger, Norweigen University of Science and Technology, 2000.

Knowledge Discovery in Large Image Databases: Dealing with Uncertainties in Ground Truth, P. Smyth, M. Burl, U. Fayyad, P. Perona, KDD Workshop 1994, AAAI Technical Report WS-94-03, pp. 109-120, 1994.


Active Learning

Improving Generalization with Active Learning, D Cohn, L. Atlas, and R. Ladner, Machine Learning 15(2), 201-221, May 1994.

On Active Learning for Data Acquisition, Z. Zheng and B. Padmanabhan, In Proc. of IEEE Intl. Conf. on Data Mining, 2002.

Active Sampling for Class Probability Estimation and Ranking, M. Saar-Tsechansky and F. Provost, Machine Learning 54:2 2004, 153-178.

The Learning-Curve Sampling Method Applied to Model-Based Clustering, C. Meek, B. Thiesson, and D. Heckerman, Journal of Machine Learning Research 2:397-418, 2002.

Active Sampling for Feature Selection, S. Veeramachaneni and P. Avesani, Third IEEE Conference on Data Mining, 2003.

Heterogeneous Uncertainty Sampling for Supervised Learning, D. Lewis and J. Catlett, In Proceedings of the 11th International Conference on Machine Learning, 148-156, 1994.

Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction, G. Weiss and F. Provost, Journal of Artificial Intelligence Research, 19:315-354, 2003.

Active Learning using Adaptive Resampling, KDD 2000, 91-98.


Cost-Sensitive Learning

Types of Cost in Inductive Concept Learning, P. Turney, In Proceedings Workshop on Cost-Sensitive Learning at the Seventeenth International Conference on Machine Learning.

Toward Scalable Learning with Non-Uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection, P. Chan and S. Stolfo, KDD 1998.


Papers

Learning when Data Sets are Imbalanced and When Costs are Unequal and Unknown, M. Maloof, in ICML Workshop on Learning from Imbalanced Datasets II, 2003.

Uncertainty Sampling Methods for One-class Classifiers, P. Juszcak and R. Duin, in ICML Workshop on Learning from Imbalanced Datasets II, 2003.

C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling, C. Drummond and R. Holte, in ICML Workshop onLearning from Imbalanced Datasets II, 2003.

C4.5 and Imbalanced Data sets: Investigating the effect of sampling method, probabilistic estimate, and decision tree structure, N. Chawla, in ICML Workshop on Learning from Imbalanced Datasets II, 2003.

Wrapper-based Computation and Evaluation of Sampling Methods for Imbalanced Datasets, N. Chawla, L. Hall, and A. Joshi, in Proceedings of the 1st International Workshop on Utility-based Data Mining, 24-33, 2005.

Learning from Little: Comparison of Classifiers Given Little of Classifiers given Little Training, G. Forman and I. Cohen, in 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, 161-172, 2004.

A Multiple Resampling Method for Learning from Imbalanced Data Sets, A. Estabrooks, T. Jo, and N. Japkowicz, in Computational Intelligence, 20(1), 2004.

A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data, G. Batista, R. Prati, and M. Monard, SIGKDD Explorations, 6(1):20-29, 2004.

Class Imbalance versus Small Disjuncts, T. Jo and N. Japkowicz, SIGKDD Explorations, 6(1): 40-49, 2004.

Extreme Re-balancing for SVMs: a Case Study, B. Raskutti and A. Kowalczyk, SIGKDD Explorations, 6(1):60-69, 2004.

Generative Oversampling for Mining Imbalanced Datasets, A. Liu, J. Ghosh, and C. Martin, Third International Conference on Data Mining (DMIN-07), 66-72.

Computing Machinery and Intelligence

Class Imbalances: Are we Focusing on the Right Issue?, N. Japkowicz, in ICML Workshop on Learning from Imbalanced Datasets II, 2003.


Recommender Systems

Trust No One: Evaluating Trust-based Filtering for Recommenders, J. O’Donovan and B. Smyth, In Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI-05), 2005, 1663-1665.

Trust in Recommender Systems, J. O’Donovan and B. Symyth, In Proceedings of the 10th International Conference on Intelligent User Interfaces (IUI-05), 2005, 167-174.


10 Cutting Edge Research-Papers In Computer Vision and Image Generation:


21 hottest research papers on Computer Vision and Machine Learning:


10 Important AI Research Papers:


6 Top NLP Papers:


The 18 Best Books About AI:


Top 30 most influential papers in the world of big data:


Top Papers on Clustering Algorithms:


5 Latest Research Papers On ML You Must Read:


Readings in Databases:


The 5 Best Data Science Books for Non-Techies:


Five Must-Read Statistics Books to Become a Successful Data Analyst:


Deep Learning Papers:


Research Papers on Programming Languages:


Numenta Research Papers:


Awesome Machine Learning Papers:


3D Detection Papers:

The papers in this list are about Autonomous Vehicles 3D Detection and  Semantic Segmentation especially those using point clouds and in deep learning methods.



Ten Trending Academic Papers on the Future of Computer Vision:


Must-Read Papers on GANs:

Generative Adversarial Networks are one of the most interesting and popular applications of Deep Learning. Here are the list of 10 papers on GANs that will give you a great introduction to GAN as well as a foundation for understanding the state-of-the-art. 



5 Must-read Papers on Product Categorization for Data Scientists:


Must read research papers on Data Structures:


Key Papers in Deep RL:



Model-Free RL:

Deep Q-Learning



Policy Gradients



Deterministic Policy Gradients



Distributional RL



Policy Gradients with Action-Dependent Baselines



Path-Consistency Learning



Other Directions for Combining Policy-Learning and Q-Learning



Evolutionary Algorithms



Exploration:

Intrinsic Motivation



Unsupervised RL



Transfer and Multitask RL:


Hierarchy:


Memory:


Model-Based RL:

Model is Learned



Model is Given



Meta-RL:


Scaling RL:


RL in the Real World:


Safety:


Imitation Learning and Inverse Reinforcement Learning:


Reproducibility, Analysis, and Critique:


Bonus: Classic Papers in RL Theory or Review:


14 NLP Research Breakthroughs You Can Apply To Your Business:


Most Downloaded Artificial Intelligence Articles:


AI Papers and Notes:


Most Influential Data Science Research Papers:


Awesome Fraud Detection Research Papers:


Machine Learning Lectures:


Assignments:



The 5 Algorithms for Efficient Deep Learning Inference on Small Devices:

Pruning Neural Networks:




Deep Compression:




Data Quantization:




Low-Rank Approximation:




Trained Ternary Quantization:




Neuro AI Papers:


Quantum ML Papers:


Healthcare ML Papers:


Human AI Interaction Papers:


Economics ML Papers:


Text Detection Papers:


Proteins ML Papers:


Genomics DL Papers:


Astronomy ML Papers:


Robust ML Papers:


Finance ML Papers:


Face Recognition Papers:


Timeline of Machine Learning:

1763 The Underpinnings of Bayes' Theorem Thomas Bayes's work An Essay towards solving a Problem in the Doctrine of Chances is published two years after his death, having been amended and edited by a friend of Bayes, Richard Price. The essay presents work which underpins Bayes theorem.
1805 Least Squares Adrien-Marie Legendre describes the "méthode des moindres carrés", known in English as the least squares method. The least squares method is used widely in data fitting.
1812 Bayes' Theorem Pierre-Simon Laplace publishes Théorie Analytique des Probabilités, in which he expands upon the work of Bayes and defines what is now known as Bayes' Theorem.
1913 Markov Chains Andrey Markov first describes techniques he used to analyse a poem. The techniques later become known as Markov chains.
1950 Turing's Learning Machine Alan Turing proposes a 'learning machine' that could learn and become artificially intelligent. Turing's specific proposal foreshadows genetic algorithms.
1951 First Neural Network Machine Marvin Minsky and Dean Edmonds build the first neural network machine, able to learn, the SNARC.
1952 Machines Playing Checkers Arthur Samuel joins IBM's Poughkeepsie Laboratory and begins working on some of the very first machine learning programs, first creating programs that play checkers.
1957 Perceptron Frank Rosenblatt invents the perceptron while working at the Cornell Aeronautical Laboratory. The invention of the perceptron generated a great deal of excitement and was widely covered in the media.
1963 Machines Playing Tic-Tac-Toe Donald Michie creates a 'machine' consisting of 304 match boxes and beads, which uses reinforcement learning to play Tic-tac-toe (also known as noughts and crosses).
1967 Nearest Neighbor The nearest neighbor algorithm was created, which is the start of basic pattern recognition. The algorithm was used to map routes.
1969 Limitations of Neural Networks Marvin Minsky and Seymour Papert publish their book Perceptrons, describing some of the limitations of perceptrons and neural networks. The interpretation that the book shows that neural networks are fundamentally limited is seen as a hindrance for research into neural networks.
1970 Automatic Differentiation (Backpropagation) Seppo Linnainmaa publishes the general method for automatic differentiation (AD) of discrete connected networks of nested differentiable functions. This corresponds to the modern version of backpropagation, but is not yet named as such.
1979 Stanford Cart Students at Stanford University develop a cart that can navigate and avoid obstacles in a room.
1979 Neocognitron Kunihiko Fukushima first publishes his work on the neocognitron, a type of artificial neural network (ANN). Neocognition later inspires convolutional neural networks (CNNs).
1981 Explanation Based Learning Gerald Dejong introduces Explanation Based Learning, where a computer algorithm analyses data and creates a general rule it can follow and discard unimportant data.
1982 Recurrent Neural Network John Hopfield popularizes Hopfield networks, a type of recurrent neural network that can serve as content-addressable memory systems.
1985 NetTalk A program that learns to pronounce words the same way a baby does, is developed by Terry Sejnowski.
1986 Backpropagation Seppo Linnainmaa's reverse mode of automatic differentiation (first applied to neural networks by Paul Werbos) is used in experiments by David RumelhartGeoff Hinton and Ronald J. Williams to learn internal representations.
1989 Reinforcement Learning Christopher Watkins develops Q-learning, which greatly improves the practicality and feasibility of reinforcement learning.
1989 Commercialization of Machine Learning on Personal Computers Axcelis, Inc. releases Evolver, the first software package to commercialize the use of genetic algorithms on personal computers.
1992 Machines Playing Backgammon Gerald Tesauro develops TD-Gammon, a computer backgammon program that uses an artificial neural network trained using temporal-difference learning (hence the 'TD' in the name). TD-Gammon is able to rival, but not consistently surpass, the abilities of top human backgammon players.
1995 Random Forest Algorithm Tin Kam Ho publishes a paper describing random decision forests.
1995 Support Vector Machines Corinna Cortes and Vladimir Vapnik publish their work on support vector machines.
1997 IBM Deep Blue Beats Kasparov IBM's Deep Blue beats the world champion at chess.
1997 LSTM Sepp Hochreiter and Jürgen Schmidhuber invent long short-term memory (LSTM) recurrent neural networks, greatly improving the efficiency and practicality of recurrent neural networks.
1998 MNIST database A team led by Yann LeCun releases the MNIST database, a dataset comprising a mix of handwritten digits from American Census Bureau employees and American high school students. The MNIST database has since become a benchmark for evaluating handwriting recognition.
2002 Torch Machine Learning Library Torch, a software library for machine learning, is first released.
2006 The Netflix Prize The Netflix Prize competition is launched by Netflix. The aim of the competition was to use machine learning to beat Netflix's own recommendation software's accuracy in predicting a user's rating for a film given their ratings for previous films by at least 10%. The prize was won in 2009.
2009 ImageNet ImageNet is created. ImageNet is a large visual database envisioned by Fei-Fei Li from Stanford University, who realized that the best machine learning algorithms wouldn't work well if the data didn't reflect the real world. For many, ImageNet was the catalyst for the AI boom of the 21st century.
2010 Kaggle Competition Kaggle, a website that serves as a platform for machine learning competitions, is launched.
2010 Wall Street Journal Profiles Machine Learning Investing The WSJ Profiles new wave of investing and focuses on RebellionResearch.com which would be the subject of author Scott Patterson's Novel, Dark Pools.
2011 Beating Humans in Jeopardy Using a combination of machine learning, natural language processing and information retrieval techniques, IBM's Watson beats two human champions in a Jeopardy! competition.
2012 Recognizing Cats on YouTube The Google Brain team, led by Andrew Ng and Jeff Dean, create a neural network that learns to recognize cats by watching unlabeled images taken from frames of YouTube videos.
2014 Leap in Face Recognition Facebook researchers publish their work on DeepFace, a system that uses neural networks that identifies faces with 97.35% accuracy. The results are an improvement of more than 27% over previous systems and rivals human performance.
2014 Sibyl Researchers from Google detail their work on Sibyl, a proprietary platform for massively parallel machine learning used internally by Google to make predictions about user behavior and provide recommendations.
2016 Beating Humans in Go Google's AlphaGo program becomes the first Computer Go program to beat an unhandicapped professional human player using a combination of machine learning and tree search techniques. Later improved as AlphaGo Zero and then in 2017 generalized to Chess and more two-player games with AlphaZero.