[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nilsjennissen/machine-learning/blob/main/notebooks/overview.ipynb)

# Data Science

Data science is a multidisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. It encompasses various fields and techniques, which can be grouped into the following categories:

1. Data Collection and Preprocessing:
- Data Acquisition: Collecting data from various sources like databases, APIs, web scraping, and IoT devices.
- Data Cleaning: Handling missing values, removing duplicates, and correcting inconsistencies.
- Data Transformation: Converting data into a suitable format for analysis, such as normalization or encoding categorical variables.

2. Data Exploration and Visualization:

- Descriptive Statistics: Summarizing data using measures like mean, median, mode, and standard deviation.
- Data Visualization: Representing data using plots and charts, such as bar charts, line charts, and scatter plots.
- Exploratory Data Analysis (EDA): Identifying patterns, trends, and relationships in the data.
3. Feature Engineering and Selection:

- Feature Engineering: Creating new features from existing data to improve model performance.
- Feature Selection: Identifying the most important features for a given problem, using techniques like correlation analysis, recursive feature elimination, and LASSO regularization.
4. Predictive Modeling and Machine Learning:

- Supervised Learning: Training models to predict outcomes based on labeled data, using algorithms like linear regression, logistic regression, and support vector machines.
- Unsupervised Learning: Identifying patterns in data without labeled outcomes, using algorithms like clustering and dimensionality reduction.
- Reinforcement Learning: Training models to make decisions based on rewards and penalties, using algorithms like Q-learning and deep reinforcement learning.
5. Model Evaluation and Validation:

- Performance Metrics: Assessing model performance using metrics like accuracy, precision, recall, F1-score, and mean squared error.
- Cross-Validation: Estimating model performance on unseen data by splitting the dataset into training and testing subsets.
- Hyperparameter Tuning: Optimizing model performance by adjusting algorithm-specific parameters, using techniques like grid search and random search.
6. Deployment and Maintenance:

- Model Deployment: Integrating trained models into production systems, using tools like REST APIs and cloud-based services.
- Model Monitoring: Tracking model performance over time and updating models as needed.
- Data Governance: Ensuring data quality, security, and compliance with regulations.

# Machine Learning
Visual overview of machine learning techniques

**Supervised Learning**

**Unsupervised Learning**

**Semi-supervised Learning**

**Reinforcement Learning**

In [8]:
# Let's create a visual overview and a database of our knowledge

# First we start with creating empty dictionaries for each

In [13]:
# Nested dict of machine learning techniques for unsupervised learning and supervised learning
# dictionary of unsupervised learning techniques
super_dict = {
    'regression': ['linear regression', 'decision tree', 'random forest', 'gradient boosting', 'neural network'],
    'classification': ['logistic regression', 'decision tree', 'random forest', 'gradient boosting', 'neural network']
}

unsuper_dict = {
    'clustering': ['k-means', 'hierarchical', 'dbscan'],
    'dimensionality reduction': ['pca', 't-sne', 'autoencoders']
}

semi_super_dict = {
    'semi-supervised learning': ['label propagation', 'label spreading']
}

reinforcement_dict = {
    'reinforcement learning': ['q-learning', 'deep q-learning']
}

In [15]:
# Visualize the nested dict of machine learning techniques in a hierarchical tree
# Use the tree package to visualize the nested dict of machine learning techniques

# import the tree package
from treelib import Node, Tree

# create a tree object
tree = Tree()

# create the tree in loops with the dictionary
# add the root node
tree.create_node('machine learning', 'machine learning')
tree.create_node('supervised learning', 'supervised learning', parent='machine learning')
tree.create_node('unsupervised learning', 'unsupervised learning', parent='machine learning')

# add the supervised learning nodes
tree.create_node('regression', 'regression', parent='supervised learning')
tree.create_node('classification', 'classification', parent='supervised learning')
tree.create_node('clustering', 'clustering', parent='unsupervised learning')
tree.create_node('dimensionality reduction', 'dimensionality reduction', parent='unsupervised learning')

tree.create_node('linear regression', 'linear regression', parent='regression')
tree.create_node('decision tree', 'decision tree', parent='regression')
tree.create_node('random forest', 'random forest', parent='regression')

tree.create_node('logistic regression', 'logistic regression', parent='classification')


tree.create_node('k-means', 'k-means', parent='clustering')
tree.create_node('hierarchical', 'hierarchical', parent='clustering')
tree.create_node('dbscan', 'dbscan', parent='clustering')

# show the tree
tree.show()

machine learning
├── supervised learning
│   ├── classification
│   │   └── logistic regression
│   └── regression
│       ├── decision tree
│       ├── linear regression
│       └── random forest
└── unsupervised learning
    ├── clustering
    │   ├── dbscan
    │   ├── hierarchical
    │   └── k-means
    └── dimensionality reduction



In [19]:
from treelib import Tree

# Create a nested dictionary of machine learning techniques with examples
ml_techniques = {
    'supervised learning': {
        'regression': ['linear regression', 'decision tree', 'random forest'],
        'classification': ['logistic regression']
    },
    'unsupervised learning': {
        'clustering': ['k-means', 'hierarchical', 'dbscan'],
        'dimensionality reduction': []
    },
    'reinforcement learning': ['Q-learning', 'Deep Q-network (DQN)'],
    'semi-supervised learning': ['self-training', 'label propagation'],
    'deep learning': {
        'neural networks': ['convolutional neural networks (CNN)', 'recurrent neural networks (RNN)'],
        'transformers': ['BERT', 'GPT']
    }
}

# Create a tree object
tree = Tree()

# Define a recursive function to populate the tree
def populate_tree(node, parent):
    for key, value in node.items():
        if isinstance(value, dict):
            child_node = tree.create_node(key, key, parent=parent)
            populate_tree(value, child_node.identifier)
        elif isinstance(value, list):
            for technique in value:
                tree.create_node(technique, technique, parent=parent)

# Add the root node separately
tree.create_node('machine learning', 'machine_learning')

# Populate the tree
populate_tree(ml_techniques, 'machine_learning')

# Show the tree
tree.show()


machine learning
├── Deep Q-network (DQN)
├── Q-learning
├── deep learning
│   ├── BERT
│   ├── GPT
│   ├── convolutional neural networks (CNN)
│   └── recurrent neural networks (RNN)
├── label propagation
├── self-training
├── supervised learning
│   ├── decision tree
│   ├── linear regression
│   ├── logistic regression
│   └── random forest
└── unsupervised learning
    ├── dbscan
    ├── hierarchical
    └── k-means



In [2]:
from treelib import Tree

# Create a nested dictionary of machine learning techniques with examples
ml_techniques = {
    'supervised learning': {
        'regression': [
            'linear regression',
            'polynomial regression',
            'ridge regression',
            'lasso regression',
            'decision tree regression',
            'random forest regression',
            'gradient boosting regression',
            'support vector regression',
            'k-nearest neighbors regression',
            'neural network regression'
        ],
        'classification': [
            'logistic regression',
            'support vector machines',
            'k-nearest neighbors classification',
            'decision tree classification',
            'random forest classification',
            'gradient boosting classification',
            'naive Bayes',
            'neural network classification',
            'ensemble methods',
            'nearest centroid'
        ]
    },
    'unsupervised learning': {
        'clustering': [
            'k-means',
            'hierarchical',
            'DBSCAN',
            'Gaussian mixture models',
            'spectral clustering',
            'meanshift',
            'OPTICS',
            'BIRCH',
            'agglomerative clustering',
            'density-based clustering'
        ],
        'dimensionality reduction': [
            'principal component analysis (PCA)',
            'linear discriminant analysis (LDA)',
            't-distributed stochastic neighbor embedding (t-SNE)',
            'autoencoders',
            'non-negative matrix factorization (NMF)',
            'independent component analysis (ICA)',
            'random projections',
            'feature selection',
            'canonical correlation analysis (CCA)',
            'factor analysis'
        ]
    },
    'reinforcement learning': [
        'Q-learning',
        'SARSA',
        'Deep Q-network (DQN)',
        'Policy gradient methods',
        'Actor-Critic methods',
        'Monte Carlo Tree Search (MCTS)',
        'Temporal Difference (TD) learning',
        'Proximal Policy Optimization (PPO)',
        'Asynchronous Advantage Actor-Critic (A3C)',
        'Deep Deterministic Policy Gradient (DDPG)'
    ],
    'semi-supervised learning': [
        'self-training',
        'label propagation',
        'generative models',
        'co-training',
        'multi-view learning',
        'transductive support vector machines',
        'graph-based methods',
        'entropy regularization',
        'consistency regularization',
        'self-ensembling'
    ],
    'deep learning': [
        'convolutional neural networks (CNN)',
        'recurrent neural networks (RNN)',
        'long short-term memory (LSTM)',
        'gated recurrent unit (GRU)',
        'autoencoders',
        'generative adversarial networks (GANs)',
        'variational autoencoders (VAEs)',
        'transformer networks',
        'attention mechanisms',
        'capsule networks',
        'BERT',
        'GPT'
    ],
    'natural language processing': [
        'tokenization',
        'lemmatization',
        'stemming',
        'stop word removal',
        'part-of-speech tagging',
        'named entity recognition',
        'sentiment analysis',
        'topic modeling',
        'word embeddings',
        'language modeling'
    ]
}

# Create a tree object
tree = Tree()

# Define a recursive function to populate the tree
def populate_tree(node, parent):
    if isinstance(node, dict):
        for key, value in node.items():
            child_node = tree.create_node(key, key, parent=parent)
            populate_tree(value, child_node.identifier)
    elif isinstance(node, list):
        for technique in node:
            tree.create_node



DuplicatedNodeIdError: Can't create node with ID 'autoencoders'

In [23]:
# Show tree
tree.show()

Tree is empty

