1.4 Billion Text Credentials Analysis (NLP)

Using deep learning and NLP to analyze a large corpus of clear text passwords.

Objectives:

Train a generative model.
Understand how people change their passwords over time: hello123 -> h@llo123 -> h@llo!23.

Disclaimer: for research purposes only.

In the press

Get the data

Download any Torrent client.
Here is a magnet link you can find on Reddit:
- magnet:?xt=urn:btih:7ffbcd8cee06aba2ce6561688cf68ce2addca0a3&dn=BreachCompilation&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Fglotorrents.pw%3A6969&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337

Deep Learning

Stay tuned!

Map the password list for each email

Generate the JSON files containing emails <-> list of passwords. Output folder is ~/BreachCompilationAnalysis.

python3 read.py --breach_compilation_folder ~/BreachCompilation

Make sure you have enough free memory (8GB should be enough).
It took 1h30m to run on a Intel(R) Core(TM) i7-6900K CPU @ 3.20GHz (on a single thread).
Uncompressed output is 13G.

Output is of the form:

> less ReducePasswordsOnSimilarEmailsCallback-z-b.json # emails starting with zb.
{
    "zb-email1@yahoo.com": [
        "pass1",
        "pass2"
    ],
    "zb-email2@yahoo.com": [
        "pass1",
        "pass2",
        "pass3"
    ],
    [...]
}

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.gitignore		.gitignore
README.md		README.md
callback.py		callback.py
read.py		read.py
requirements.txt		requirements.txt
shp.py		shp.py
utils.py		utils.py
zip_pickle.py		zip_pickle.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

README.md

README.md

callback.py

callback.py

read.py

read.py

requirements.txt

requirements.txt

shp.py

shp.py

utils.py

utils.py

zip_pickle.py

zip_pickle.py

Repository files navigation

1.4 Billion Text Credentials Analysis (NLP)

In the press

Get the data

Deep Learning

Map the password list for each email

About

Releases

Packages

Languages

linexjlin/tensorflow-1.4-billion-password-analysis

Folders and files

Latest commit

History

Repository files navigation

1.4 Billion Text Credentials Analysis (NLP)

In the press

Get the data

Deep Learning

Map the password list for each email

About

Resources

Stars

Watchers

Forks

Languages