<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1">Introduction</a></span></li><li><span><a href="#Imports" data-toc-modified-id="Imports-2">Imports</a></span></li><li><span><a href="#Get-the-data" data-toc-modified-id="Get-the-data-3">Get the data</a></span></li><li><span><a href="#Generate-synthetic-data" data-toc-modified-id="Generate-synthetic-data-4">Generate synthetic data</a></span><ul class="toc-item"><li><span><a href="#via-GANs" data-toc-modified-id="via-GANs-4.1">via GANs</a></span></li><li><span><a href="#via-supervised-learning" data-toc-modified-id="via-supervised-learning-4.2">via supervised learning</a></span></li><li><span><a href="#via-unsupervised-learning" data-toc-modified-id="via-unsupervised-learning-4.3">via unsupervised learning</a></span></li></ul></li><li><span><a href="#References" data-toc-modified-id="References-5">References</a></span></li></ul></div>

# Introduction
<hr style = "border:2px solid black" ></hr>

<div class="alert alert-warning">
<font color=black>

**What?** Synthetic data generation

</font>
</div>

# Imports
<hr style = "border:2px solid black" ></hr>

In [None]:
from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np
import matplotlib. pyplot as plt
import yfinance as yf
import datetime
import warnings
import numpy as np
#from ctgan import CTGANSynthesizer
#from sdv.evaluation import evaluate
#from table_evaluator import TableEvaluator
from sklearn.datasets import make_classification
warnings.filterwarnings('ignore')

# Get the data
<hr style = "border:2px solid black" ></hr>

In [None]:
X, y = fetch_california_housing(return_X_y=True)

In [None]:
california_housing=np.column_stack([X, y])
california_housing_df=pd.DataFrame(california_housing)

In [None]:
california_housing_df.describe()

# Generate synthetic data
<hr style = "border:2px solid black" ></hr>

<div class="alert alert-info">
<font color=black>

- The following methods will be used to generate synthetic data:
    - [x] via GANs
    - [x] via supervised learning
    - [x] via unsupevised learning
    - [] via HMM - Hidden Markov Model
     
- For each method the quality of the synthetic data will be evaluated in two ways:
    - [x] via pandas `.describe()`
    - [x] via sdv package - Synthetic Data Vault (SDV)

</font>
</div>

## via GANs

In [None]:
ctgan = CTGANSynthesizer(epochs=10)
ctgan.fit(california_housing_df)
synt_sample = ctgan.sample(len(california_housing_df))

In [None]:
california_housing_df.describe()

In [None]:
synt_sample.describe()

In [None]:
evaluate(synt_sample, california_housing_df)

In [None]:

table_evaluator =  TableEvaluator(california_housing_df, synt_sample)

table_evaluator.visual_evaluation()

## via supervised learning

In [None]:
from sklearn.datasets import make_regression
import matplotlib.pyplot as plt
from matplotlib import cm

In [None]:
X, y = make_regression(n_samples=1000, n_features=3, noise=0.2,
                       random_state=123)

plt.scatter(X[:, 0], X[:, 1], alpha= 0.3, cmap='Greys', c=y)

In [None]:
plt.figure(figsize=(18, 18))
k = 0

for i in range(0, 10):
    X, y = make_regression(n_samples=100, n_features=3, noise=i,
                           random_state=123) 
    k+=1
    plt.subplot(5, 2, k)
    profit_margin_orange = np.asarray([20, 35, 40])
    plt.scatter(X[:, 0], X[:, 1], alpha=0.3, cmap=cm.Greys, c=y)
    plt.title('Synthetic Data with Different Noises: ' + str(i))
plt.show()

In [None]:
plt.figure(figsize=(18, 18))
k = 0

for i in range(2, 6):
    X, y = make_classification(n_samples=100,
                               n_features=4,
                               n_classes=i,
                               n_redundant=0,
                               n_informative=4,
                               random_state=123)
    k+=1
    plt.subplot(2, 2, k)
    plt.scatter(X[: ,0], X[:, 1], alpha=0.8, cmap='gray', c=y)
    plt.title('Synthetic Data with Different Classes: ' + str(i))
plt.show()

## via unsupervised learning

In [None]:
from sklearn.datasets import make_blobs

In [None]:
X, y = make_blobs(n_samples=100, centers=2, 
                      n_features=2, random_state=0)

In [None]:
plt.figure(figsize=(18, 18))
k = 0
for i in range(2, 6):
    X, y = make_blobs(n_samples=100, centers=i,
                      n_features=2, random_state=0)
    k += 1
    plt.subplot(2, 2, k)
    my_scatter_plot = plt.scatter(X[:, 0], X[:, 1],
                                  alpha=0.3, cmap='gray', c=y)
    plt.title('Synthetic Data with Different Clusters: ' + str(i))
plt.show()

# References
<hr style = "border:2px solid black" ></hr>

<div class="alert alert-warning">
<font color=black>

- https://github.com/abdullahkarasan/mlfrm/blob/main/codes/chp_10.ipynb
- Machine Learning for Financial Risk Management with Python Abdullah Karasan

</font>
</div>