<a href="https://colab.research.google.com/github/honestycitra/dl_hw_test/blob/main/PassGAN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **PassGAN**

Source: https://github.com/DSC-UI-SRIN/Introduction-to-GAN/blob/master/4%20-%20Applications%20of%20GANs/PassGAN.ipynb

#**Introduction**

##**Abstract**:

State-of-the-art password guessing tools, such as HashCat and John the Ripper, enable users to check billions of passwords per second against password hashes. In addition to performing straightforward dictionary attacks, these tools can expand password dictionaries using password generation rules, such as concatenation of words (e.g., "password123456") and leet speak (e.g., "password" becomes "p4s5w0rd"). Although these rules work well in practice, expanding them to model further passwords is a laborious task that requires specialized expertise. To address this issue, in this paper we introduce PassGAN, a novel approach that replaces human-generated password rules with theory-grounded machine learning algorithms. Instead of relying on manual password analysis, PassGAN uses a Generative Adversarial Network (GAN) to autonomously learn the distribution of real passwords from actual password leaks, and to generate high-quality password guesses. Our experiments show that this approach is very promising. When we evaluated PassGAN on two large password datasets, we were able to surpass rule-based and state-of-the-art machine learning password guessing tools. However, in contrast with the other tools, PassGAN achieved this result without any a-priori knowledge on passwords or common password structures. Additionally, when we combined the output of PassGAN with the output of HashCat, we were able to match 51%-73% more passwords than with HashCat alone. This is remarkable, because it shows that PassGAN can autonomously extract a considerable number of password properties that current state-of-the art rules do not encode.

# **Prerequest**

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
# import All prerequisites
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.autograd as autograd
from torchvision import datasets, transforms
from torch.autograd import Variable
from torchvision.utils import save_image
import numpy as np
import os
Tensor = torch.cuda.FloatTensor if torch.cuda.is_available() else torch.FloatTensor

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
ROOT = "password/"

# Make dir if no exist
if not os.path.exists(ROOT):
    os.makedirs(ROOT)

# Download Library
!curl --remote-name \
     -H 'Accept: application/vnd.github.v3.raw' \
     --location https://raw.githubusercontent.com/DSC-UI-SRIN/Introduction-to-GAN/master/4%20-%20Applications%20of%20GANs/password/datasets.py

!curl --remote-name \
     -H 'Accept: application/vnd.github.v3.raw' \
     --location https://raw.githubusercontent.com/DSC-UI-SRIN/Introduction-to-GAN/master/4%20-%20Applications%20of%20GANs/password/utils.py

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  6417  100  6417    0     0  66843      0 --:--:-- --:--:-- --:--:-- 66843
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  4331  100  4331    0     0  77339      0 --:--:-- --:--:-- --:--:-- 77339


# **Dataset**

In [None]:
import datasets
batch_size = 100

# Rockyou Dataset

train_dataset = datasets.Rockyou(root=ROOT, train=True, download=True, input_size=(10,0), tokenize=False)

# Data Loader (Input Pipeline)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)

examples = enumerate(train_loader)
batch_idx, (example_data, example_targets) = next(examples)

Processing for the the first time...
Pre-process data..


21315685it [00:36, 584442.41it/s]


Make Charmap..


100%|██████████| 204/204 [00:00<00:00, 776441.03it/s]


Filter data and convert to discrete


100%|██████████| 21315685/21315685 [01:23<00:00, 256129.54it/s]


loaded 21315685 lines in dataset


In [None]:
print(example_data.shape)

#**Building Model**

In [None]:
from torch import nn, functional

class ResBlock(nn.Module):
    def __init__(self, dim, kernel_size=5):
        super(ResBlock, self).__init__()
        self.model = nn.Sequential(
            nn.ReLU(),
            nn.Conv1d(dim, dim, padding=kernel_size//2, kernel_size=kernel_size),
            nn.ReLU(),
            nn.Conv1d(dim, dim, padding=kernel_size//2, kernel_size=kernel_size)
        )

    def forward(self, input_data):
        output = (self.model(input_data))
        return input_data + output

## **Generator Model**

In [None]:
class Generator(nn.Module):
    def __init__(self, seq_len, layer_dim, z_dim, char_len):
        super(Generator, self).__init__()
        self.seq_len = seq_len
        self.layer_dim = layer_dim
        self.z_dim = z_dim
        self.char_len = char_len

        self.linear = nn.Linear(self.z_dim, self.seq_len*self.layer_dim)

        self.res_blocks = nn.Sequential(
            ResBlock(self.layer_dim),
            ResBlock(self.layer_dim),
            ResBlock(self.layer_dim),
            ResBlock(self.layer_dim),
            ResBlock(self.layer_dim),
        )
        self.conv = nn.Conv1d(self.layer_dim, self.char_len, kernel_size=1)

    def softmax(self, logits, num_classes):
        logits = logits.reshape(-1, num_classes)
        logits = logits.softmax(1)
        return logits.reshape(-1, self.seq_len, self.char_len)

    def forward(self, z_input):
        output = self.linear(z_input)
        output = output.view(-1, self.layer_dim, self.seq_len)
        output = self.res_blocks(output)
        output = self.conv(output)
        output = output.permute([0, 2, 1])
        output = self.softmax(output, self.char_len)
        return output

## **Discriminator Model**

In [None]:
class Discriminator(nn.Module):
    def __init__(self, seq_len, layer_dim, char_len):
        super(Discriminator, self).__init__()
        self.seq_len = seq_len
        self.layer_dim = layer_dim
        self.char_len = char_len

        self.conv = nn.Conv1d(self.char_len, self.layer_dim, kernel_size=1)

        self.res_blocks = nn.Sequential(
            ResBlock(self.layer_dim),
            ResBlock(self.layer_dim),
            ResBlock(self.layer_dim),
            ResBlock(self.layer_dim),
            ResBlock(self.layer_dim),
        )
        self.linear = nn.Linear(self.seq_len*self.layer_dim, 1)

    def forward(self, input_data):
        output = input_data.permute([0, 2, 1])
        output = self.conv(output)
        output = self.res_blocks(output)
        output = output.view(-1, self.layer_dim*self.seq_len)
        output = self.linear(output)
        return output

# **Building Network**

In [None]:
# build network
z_dim = 128
seq_len = 10
layer_dim = 128


G = Generator(seq_len, layer_dim, z_dim, len(train_dataset.class_to_idx)).to(device)
D = Discriminator(seq_len, layer_dim, len(train_dataset.class_to_idx)).to(device)

In [None]:
print(G, D)

# **Train Process**

## **Gradient Penalty**

In [None]:
def compute_gradient_penalty(D, real_data, fake_data):
    # Random weight term for interpolation between real and fake samples
    alpha = Tensor(
        np.random.random((real_data.size(0), 1, 1)))

    # Get random interpolation between real and fake samples
    interpolates = alpha * real_data + ((1 - alpha) * fake_data)
    d_interpolates = D(interpolates.requires_grad_(True))
    fake = Tensor(real_data.shape[0], 1).fill_(1.0)

    # Get gradient w.r.t. interpolates
    grads = autograd.grad(
        outputs=d_interpolates,
        inputs=interpolates,
        grad_outputs=fake,
        create_graph=True,
        retain_graph=True,
        only_inputs=True,
    )[0]

    grads = grads.reshape(grads.size(0), -1)
    grad_penalty = ((grads.norm(2, dim=1) - 1) ** 2).mean()
    return grad_penalty

In [None]:
# Loss weight for gradient penalty
lambda_gp = 10

# optimizer
lr = 1e-4
n_critic =  5
b1 = 0.5
b2 = 0.999

optimizer_G = torch.optim.Adam(G.parameters(), lr=lr, betas=(b1, b2))
optimizer_D = torch.optim.Adam(D.parameters(), lr=lr, betas=(b1, b2))

In [None]:
from torch.utils.tensorboard import SummaryWriter

logdir = './runs'
os.makedirs(logdir, exist_ok=True)

writer = SummaryWriter(logdir)

In [None]:
%load_ext tensorboard
%tensorboard --logdir runs/

In [None]:
def check_generated_data(samples, iters, tag="result"):
    """
    this function used for check the result of generator network and save it to tensorboard
    :param samples(dict): samples of input network
    :param tag: save the output to tensorboard log wit tag
    :param iters: global iteration counts for tensorboard logging
    :return:
    """

    G.eval()
    with torch.no_grad():
        inv_charmap = train_dataset.idx_to_class

        samples = G(samples)

        if torch.cuda.is_available():
            samples = samples.cpu().numpy()
        else:
            samples = samples.numpy()

        samples = np.argmax(samples, axis=2)

        decoded_samples = []
        for i in range(len(samples)):
            decoded = []
            for j in range(len(samples[i])):
                decoded.append(inv_charmap[samples[i][j]])
            decoded_samples.append("".join(decoded).replace('`', ""))
        # print(", ".join(decoded_samples))
        writer.add_text(tag, ", ".join(decoded_samples), iters)

In [None]:
epochs = 200
list_loss_D = []
list_loss_G = []
fixed_z = Variable(Tensor(np.random.normal(0, 1, (10, z_dim))))
for epoch in range(epochs):
    for i, (X, _) in enumerate(train_loader):
        # Configure input
        real_data = Variable(X.type(Tensor))

        # ---------------------
        #  Train Discriminator
        # ---------------------

        optimizer_D.zero_grad()

        # Sample noise as generator input
        z = Variable(Tensor(np.random.normal(0, 1, (real_data.shape[0], z_dim))))

        # Generate a batch of images
        fake_data = G(z).detach()

        # Gradient penalty
        gradient_penalty = compute_gradient_penalty(D, real_data.data, fake_data.data)

        # Adversarial loss
        d_loss = -torch.mean(D(real_data)) + torch.mean(D(fake_data)) + lambda_gp * gradient_penalty

        d_loss.backward()
        optimizer_D.step()

        # Train the generator every n_critic iterations
        if i % n_critic == 0:

            # -----------------
            #  Train Generator
            # -----------------

            optimizer_G.zero_grad()

            # Generate a batch of images
            gen_data = G(z)
            # Adversarial loss
            g_loss = -torch.mean(D(gen_data))

            g_loss.backward()
            optimizer_G.step()

            list_loss_D.append(d_loss.item())
            list_loss_G.append(g_loss.item())
        
        if i % 300 == 0:
            print(
              "[Epoch %d/%d] [Batch %d/%d] [D loss: %f] [G loss: %f]"
              % (epoch, epochs, i, len(train_loader), d_loss.item(), g_loss.item()))
            writer.add_scalar('G_loss', g_loss.item(), epoch * len(train_loader) + i)
            writer.add_scalar('D_loss', d_loss.item(), epoch * len(train_loader) + i)

    if epoch % 5 == 0:
        check_generated_data(fixed_z, tag="result_{}".format(epoch), iters=epoch * len(train_loader) + i)