<a href="https://colab.research.google.com/github/nguila/IIAC/blob/main/Assessing_Password_Security_Using_Machine_Learning_final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assessing password security using ML

*Password cracking is the systematic endeavor of discovering the password of a secure system. Cracking can involve using common passwords, cleverly generated candidate passwords (for example, replacing the letter O with the number 0 or writing a word backward), or just using a plain bruteforce exhaustive search. To make it more difficult to crack a password, a strong password must be chosen.*

Source: Emmanuel Tsukerman, Machine Learning for Cybersecurity Cookbook, chapter 7

In [5]:
# https://github.com/nunoaflopes/IA4cyber-Livro1-ML-for-Cyber-Cookbook-Packt/raw/master/Chapter07/Assessing%20Password%20Security%20Using%20Machine%20Learning/passwordDataset.7z

# file: passwordDataset.7z
# size: 4633682

!wget https://github.com/nunoaflopes/IA4cyber-Livro1-ML-for-Cyber-Cookbook-Packt/raw/master/Chapter07/Assessing%20Password%20Security%20Using%20Machine%20Learning/passwordDataset.7z


--2023-06-12 13:51:24--  https://github.com/nunoaflopes/IA4cyber-Livro1-ML-for-Cyber-Cookbook-Packt/raw/master/Chapter07/Assessing%20Password%20Security%20Using%20Machine%20Learning/passwordDataset.7z
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/nunoaflopes/IA4cyber-Livro1-ML-for-Cyber-Cookbook-Packt/master/Chapter07/Assessing%20Password%20Security%20Using%20Machine%20Learning/passwordDataset.7z [following]
--2023-06-12 13:51:25--  https://raw.githubusercontent.com/nunoaflopes/IA4cyber-Livro1-ML-for-Cyber-Cookbook-Packt/master/Chapter07/Assessing%20Password%20Security%20Using%20Machine%20Learning/passwordDataset.7z
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:4

In [6]:
# unzip 7z file

!p7zip -d passwordDataset.7z
!ls -la



7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,2 CPUs Intel(R) Xeon(R) CPU @ 2.20GHz (406F0),ASM,AES-NI)

Scanning the drive for archives:
  0M Scan         1 file, 4633682 bytes (4526 KiB)

Extracting archive: passwordDataset.7z
--
Path = passwordDataset.7z
Type = 7z
Physical Size = 4633682
Headers Size = 146
Method = LZMA2:12m
Solid = -
Blocks = 1

  0%     22% - passwordDataset.csv                           89% - passwordDataset.csv                          Everything is Ok

Size:       9368084
Compressed: 4633682
total 9168
drwxr-xr-x 1 root root    4096 Jun 12 13:51 .
drwxr-xr-x 1 root root    4096 Jun 12 13:44 ..
drwxr-xr-x 4 root root    4096 Jun  8 18:17 .config
-rw-r--r-- 1 root root 9368084 May  8  2019 passwordDataset.csv
drwxr-xr-x 1 root root    4096 

In [7]:
#Pandas Python library. Let’s import pandas.
import pandas as pd

df = pd.read_csv(
    "passwordDataset.csv", dtype={"password": "str", "strength": "int"}, index_col=None
)

In [8]:
df

Unnamed: 0,password,strength
0,kzde5577,1
1,kino3434,1
2,visi7k1yr,1
3,megzy123,1
4,lamborghin1,1
...,...,...
669634,10redtux10,1
669635,infrared1,1
669636,184520socram,1
669637,marken22a,1


In [21]:
df = df.sample(frac=1)

In [22]:
#We have used a test_size=0.2 so that 80% of the data is used for training and 20% for testing

l = len(df.index)
train_df = df.head(int(l * 0.8))
test_df = df.tail(int(l * 0.2))

In [11]:
y_train = train_df.pop("strength").values
y_test = test_df.pop("strength").values

In [12]:
X_train = train_df.values.flatten()
X_test = test_df.values.flatten()

In [13]:
def character_tokens(input_string):
    """Break string into characters."""
    return [x for x in input_string]

In [23]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.tree import DecisionTreeClassifier
# DecisionTree produces a model with a very accurate score:

password_clf = Pipeline([
    ("vect", TfidfVectorizer(tokenizer=character_tokens)),
    ("clf", DecisionTreeClassifier()),
])


In [15]:
password_clf.fit(X_train, y_train)



In [16]:
password_clf.score(X_train, y_train)

0.9996901314328062

In [17]:
password_clf.score(X_test, y_test)

0.9274007481687785

In [18]:
common_password = "qwerty"
strong_computer_generated_password = "c9lCwLBFmdLbG6iWla4H"

In [19]:
password_clf.predict([common_password, strong_computer_generated_password])

array([0, 1])