# Password Strength Classifier using NLP 
### Cryptography and Cyber Security IA1
---
**Group Memebers :**

| Name | Roll Number |
| :--- | :--- |
| Aakash Saroop | 1911001 |
| Bhairav Narkhede | 1911003 |
| Pathik Ghugare | 1911014 |

# Dataset info 
![password](https://images.ctfassets.net/q33z48p65a6w/7GmTIyrf7kNHSyje5E8mxB/aae3dc9e041425ead15b52ecb3e70bab/how-to-make-a-strong-password.png?w=1200&h=645&fit=thumb)

The passwords used in dataset are from 000webhost leak that is available online. 
All the passwords are given a strength rating as per the tool called PARS by Georgia Tech university which have all the commercial password meters integrated into it.

# Importing necessary libraries

In [22]:
import string
import tqdm
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
from sklearn.model_selection import train_test_split

In [23]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
plt.rcParams.update({'font.size': 20})
warnings.filterwarnings("ignore")

# Peeking into our data

In [24]:
df = pd.read_csv("../input/password-strength-classifier-dataset/data.csv", on_bad_lines='skip')
# adding on_bad_lines="skip" as it was causing ParseError for 3 rows

In [25]:
df.head()

In [26]:
df.shape

# Data Cleaning

## Handeling issing values

In [27]:
df.isna().sum()

**Only one value is missing so lets drop it**

In [28]:
df.dropna(inplace=True)

In [29]:
df.isna().sum()

## Checking data duplication

In [30]:
df[df.duplicated()]

**No duplicate rows are present**

# Simple EDA 

## What are most frequent characters occuring in passwords ?

In [31]:
char_count = {}
for password in df["password"] :
    for letter in password :
        char_count[letter] = char_count.get(letter.lower(),0)+1 
        # if letter is present increament count otherwise initialize it to 0

In [32]:
sorted_counts = {k: v for k, v in sorted(char_count.items(), reverse=True, key=lambda item: item[1])}

In [33]:
plt.figure(figsize=(20,10))
sns.barplot(x=list(sorted_counts.keys())[:30], y=list(sorted_counts.values())[:30]);

**Characters such a, A and 1 and occuring very frequently than the others**

In [34]:
sns.displot(x="strength", data=df);

**So the data is quite unbalanced as we can see that we have more number of passwords having strength equal to 1 as compared to passwords having strength 0 and 2**

## Affect of password length on strength

In [35]:
df['length'] = df["password"].apply(lambda x:len(x) )

In [36]:
df.groupby('length').agg({'strength':'mean'}).plot(kind='bar', figsize=(20,10));

**Passwords having long length tend to be stronger than the shorter ones**

# Model 1

Aakash

In [37]:
password_tuple = np.array(df[['password', 'strength']])

In [38]:
import random
random.shuffle(password_tuple)

In [39]:
y = [labels[1] for labels in password_tuple ]

In [40]:
x = [labels[0] for labels in password_tuple ]

In [41]:
def word_char(inputs):
    a= []
    for i in inputs:
        a.append(i)
    return a

In [42]:
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(tokenizer = word_char)
x =vect.fit_transform(x)

In [43]:
x.shape

In [44]:
import xgboost as xgb
from sklearn.model_selection import train_test_split

In [45]:
X_train,X_test ,y_train,y_test = train_test_split(x,y,test_size = 0.20,random_state = 42)
xg = xgb.XGBClassifier(eval_metric='mlogloss')

In [46]:
xg.fit(X_train,y_train)

In [47]:
xg.score(X_test,y_test)

In [48]:
inp = ",8+3t)kGE#a.b)%("
inp =vect.transform([inp])

In [49]:
inp.shape

In [50]:
xg.predict(inp)

# Model 2

bhrairav 

In [51]:
from sklearn.linear_model import LogisticRegression

In [52]:
clf=LogisticRegression(random_state=0, solver='sag', multi_class='multinomial')

In [53]:
clf.fit(X_train,y_train)

In [54]:
dt=np.array(['%@123abcd'])
pred=vect.transform(dt)
clf.predict(pred)

In [55]:
y_pred=clf.predict(X_test)
y_pred

In [56]:
from sklearn.metrics import confusion_matrix,accuracy_score

In [57]:
cm=confusion_matrix(y_test,y_pred)

In [58]:
sns.set(font_scale=1.4) # for label size
sns.heatmap(confusion_matrix(y_test,y_pred), cmap="Greens", annot=True, annot_kws={"size": 16}, fmt ='.1f' ); # font size

In [59]:
print(accuracy_score(y_test,y_pred))

# Model 3

Pathik 

In [60]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [61]:
def word_to_chars(word):
    return list(word)

In [62]:
word_to_chars("kzde5577")

In [63]:
vectorizer = TfidfVectorizer(tokenizer=word_to_chars)

In [64]:
X, y = df.values[:,0], df.values[:,1]

In [65]:
y=y.astype('int')

In [66]:
X = vectorizer.fit_transform(X)

In [67]:
X.shape

In [68]:
X_train, X_test, y_train, y_test \
 = train_test_split(X,y,test_size=0.2)

In [69]:
X_train.shape

In [70]:
from sklearn.ensemble import RandomForestClassifier

In [71]:
clf = RandomForestClassifier(n_estimators=10,criterion='entropy')

In [72]:
clf.fit(X_train,y_train)

In [73]:
y_pred=clf.predict(X_test)

In [74]:
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

In [75]:
sns.set(font_scale=1.4) # for label size
sns.heatmap(confusion_matrix(y_test,y_pred), cmap="Greens", annot=True, annot_kws={"size": 16}, fmt ='.1f' ); # font size

In [76]:
accuracy_score(y_test,y_pred)

In [77]:
print(classification_report(y_test,y_pred))

In [78]:
clf.predict(vectorizer.transform(["P@Thik!23am3$#"]))

**As the Xgboost classifier gave the best results using Vectorizer thus we will be using the same for our webapp**

In [79]:
import joblib

In [80]:
xgb_pipeline_objects = {
    'vectorizer' : vect,
    'model' : xg
}

In [81]:
joblib.dump(xgb_pipeline_objects, 'xgb_pipeline_objects.pkl')

### References :
* https://www.kaggle.com/bhavikbb/password-strength-classifier-dataset
* https://www.kaggle.com/avi111297/pred-strength-of-a-p-w-logistic-regres-82-04-acc
* https://www.kaggle.com/alrafiaurnob/password-strength-classifier-rf-97-acc