## Visual insights into Passwords: Part 2

### Objective
In part 1, we took a look the dataset, and had a few visualizations. Next, we want to conduct topic modelling from a machine learning point of view. More specifically, we will employ a [fastText embedding](https://en.wikipedia.org/wiki/FastText) to represent passwords as 300-dimensional vectors. To visualize them in 2 or 3 dimensions, we will use dimension reduction techniques. 

In [1]:
# import the necessary libraries
import pandas as pd
import numpy as np
import gensim
from gensim.models import fasttext
import matplotlib
import matplotlib.pyplot as plt
import PyQt5
import umap
%matplotlib qt

In [2]:
df = pd.read_csv('../data/data2use/USA2/data0.csv', index_col=0)
df.head(10)

Unnamed: 0,password,frequency,distance_score,passlength,unique_c,first_char,last_char,number_of_uppercase,number_of_digits,number_of_symbols,number_of_lowercase,category,zxcvbn
0,123456,67329,1.0,6,6,digit,digit,0,6,0,0,numeric,0
1,123456789,25745,1.0,9,9,digit,digit,0,9,0,0,numeric,0
2,qwerty,25539,1.0,6,6,lower,lower,0,0,0,6,alphabetic,0
3,password,11259,3.5,8,7,lower,lower,0,0,0,8,alphabetic,0
4,12345,9922,1.0,5,5,digit,digit,0,5,0,0,numeric,0
5,b123456,9150,1.54,7,7,lower,digit,0,6,0,1,numeric,1
6,123456b,9143,1.43,7,7,digit,lower,0,6,0,1,numeric,1
7,123456c,8251,1.67,7,7,digit,lower,0,6,0,1,numeric,1
8,c123456,8244,1.36,7,7,lower,digit,0,6,0,1,numeric,1
9,12345678,8088,1.0,8,8,digit,digit,0,8,0,0,numeric,0


For our current purpose, we mostly only need the password column of the dataset. 

### Password embedding with fastText
Word embedding is a technique used in Natural Language Processing (NLP) to represent words as points in a high-dimensional space, thus having associated numerical values. The most basic desired property of a word embedding is that words with similar meaning should be close to each other. 

fastText is a library for word embedding created by Facebook's AI Research lab. A word embedding has a vocabulary that it is built upon **----to be continued**

In [4]:
# loading fasttext model
model = fasttext.load_facebook_vectors('../data/fasttext_models/wiki.en.bin')

In [10]:
# example of an embedding password
model['hello123']

array([-0.12791799, -0.06494759, -0.04391411, -0.09344378, -0.20545131,
       -0.07863272,  0.16797353, -0.17578395, -0.05386898, -0.15918167,
       -0.18677713,  0.05414825,  0.07892729, -0.26773155,  0.240542  ,
       -0.26164737,  0.0308893 ,  0.01671401,  0.08233713, -0.14278439,
       -0.00716576,  0.3266556 ,  0.07943831,  0.07748769, -0.15551227,
       -0.1852229 , -0.15285002, -0.12167276, -0.05522198,  0.2579628 ,
       -0.3626395 ,  0.15684427, -0.20550492,  0.13036738, -0.16333723,
       -0.04185367, -0.24040319, -0.05731153, -0.11681544, -0.01761103,
        0.17473377, -0.09494759,  0.06681579,  0.08521983, -0.02771697,
        0.19598943, -0.07935197,  0.03730717, -0.38345242,  0.01025466,
       -0.09711715, -0.4918841 ,  0.16596034,  0.02118551,  0.04988703,
        0.08059371, -0.00439768, -0.28616515,  0.14371178,  0.23619308,
        0.10012786,  0.26847646, -0.09483156, -0.26119354,  0.04041209,
        0.03156268,  0.14464356,  0.27149758, -0.16569377,  0.05

Note that this is a 300-dimensional vectors. 

In [None]:
# Create an empty dataframe to store the embedding vectors
Emb = pd.DataFrame(columns=range(300))