## Chatbot Kiti
Here I am planning to build an information-based chatbot which will dive deep into US Citizenship and Immigration services (USCIS) policy manual and able to response user queries. This USCIS Policy Guide includes the official policies of USCIS and supports immigration-officers for making decisions. The idea is to have non-immigrant policy as source of this research. This research will contain 137 legal documents which is summarized under USCIS Volume 2(https://www.uscis.gov/policy-manual/volume-2) User should be able to ask questions related to any non-immigration related visas and get non-legal advice from this bot.

**Importing the required libraries**

In [12]:
import numpy as np  #For numerical computation in python
import nltk         #For natural language processing
import string       #process strings in python
import random
import warnings
warnings.filterwarnings('ignore')

**Importing and reading the corpus**

In [13]:
f=open('data_USCIS.txt','r',errors = 'ignore')
raw_doc=f.read()
raw_doc=raw_doc.lower() #Converts text to lowercase
nltk.download('punkt') #Using the Punkt tokenizer. Other tokenizers inlcude tweepy, RegEx etc
nltk.download('wordnet') #Using the WordNet dictionary
sent_tokens = nltk.sent_tokenize(raw_doc) #Converts doc to list of sentences 
word_tokens = nltk.word_tokenize(raw_doc) #Converts doc to list of words

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\kdhar\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\kdhar\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


**sentance tokens**

In [14]:
sent_tokens[:10]  #Printing first Ten sentences

['policy manual\nthe uscis policy manual is the agencyâ€™s centralized online repository for uscisâ€™ immigration policies.',
 'the uscis policy manual will ultimately replace the adjudicatorâ€™s field manual (afm), the uscis immigration policy memoranda site, and other policy repositories.',
 'about the policy manual\nthe uscis policy manual is the agencyâ€™s centralized online repository for uscisâ€™ immigration policies.',
 'the policy manual is replacing the adjudicatorâ€™s field manual (afm), the uscis immigration policy memoranda site, and other uscis policy repositories.',
 'the policy manual contains separate volumes pertaining to different areas of immigration benefits administered by the agency, such as citizenship and naturalization, adjustment of status, and nonimmigrants.',
 'the content is organized into different volumes, parts, and chapters.',
 'the policy manual provides transparency of immigration policies and furthers consistency, quality, and efficiency consistent w

In [15]:

print('Total sentences in document:', len(sent_tokens ))

Total sentences in document: 960


## Preprocessing

This chatbot will be based on information based. All policy data will be used as a text document. As we are working on policy document So, there are no missing text entries. This bot will work on nonstructural data. I am planning to perform several data normalization steps to normalize corps. The main question with text data is that it is all in unstructured text format (strings). However, the Machine learning algorithms can understand numerical feature vector in order to perform the task. I will be using he NLTK data package includes a pre-trained Punkt tokenizer for English. Here are few Key component That I will perform during My project as Removing Noise, Removing the Stop words, Perform Stemming operation, Perform Lemmatization operation, Perform NER, Perform POS, Perform Multiple Stem Operation. I am planning to use CountVectorizer from Sklearn-featureextraction to create Vocabulary of words from corpus. After that I am planning to perform the Term Frequency-Inverse Document Frequency (TF-IDF) vectors. TF-IDF will give us matrix. After that I will perform cosine similarity to compute a numeric measure that indicates the similarity between the two words.

This step involves word tokenization, Removing ASCII values, Removing tags of any kind, Part-of-speech tagging, and Lemmatization.

**word tokens**

In [16]:
word_tokens[:10]  #Print first 10 words

['policy',
 'manual',
 'the',
 'uscis',
 'policy',
 'manual',
 'is',
 'the',
 'agencyâ€™s',
 'centralized']

**Text preprocessing**

In [17]:
lemmer = nltk.stem.WordNetLemmatizer()
#WordNet is a semantically-oriented dictionary of English included in NLTK library.
def LemTokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

**Defining the greeting function**

In [18]:
GREET_INPUTS = ("hello", "hi", "greetings", "sup", "what's up","hey")   #sup is Millenial shortform for what's up?
GREET_RESPONSES = ["hi", "hey", "Hi, How are you?", "hi there", "hello", "I am glad! You are talking to me"]

def greet(sentence): 
    for word in sentence.split():
        if word.lower() in GREET_INPUTS:
            return random.choice(GREET_RESPONSES)

**Response generation**

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer  #Term frequency and inverse document frequency(for rare words)
from sklearn.metrics.pairwise import cosine_similarity       #It gives normalized vectors to the machine for it to understand.

In [20]:
def response(user_response):
  robo1_response=''
  TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
  tfidf = TfidfVec.fit_transform(sent_tokens)
  vals = cosine_similarity(tfidf[-1], tfidf)
  idx=vals.argsort()[0][-2]
  flat = vals.flatten()
  flat.sort()
  req_tfidf = flat[-2]
  if(req_tfidf==0):
    robo1_response=robo1_response+"I am sorry! I don't understand you"
    return robo1_response
  else:
    robo1_response = robo1_response+sent_tokens[idx]
    return robo1_response

**Defining conversation start/end protocols**

In [21]:
flag=True
print("BOT: My name is Kiti. Let's have a conversation! Also, if you want to exit any time, just type Bye!")
while(flag==True):
    user_response = input()
    user_response=user_response.lower()
    if(user_response!='bye'):
        if(user_response=='thanks' or user_response=='thank you' ):
            flag=False
            print("BOT: You are welcome..")
        else:
            if(greet(user_response)!=None):
                print("BOT: "+greet(user_response))
            else:
                sent_tokens.append(user_response)
                word_tokens=word_tokens+nltk.word_tokenize(user_response)
                final_words=list(set(word_tokens))
                print("BOT: ",end="")
                print(response(user_response))
                sent_tokens.remove(user_response)
    else:
        flag=False
        print("BOT: Goodbye! Take care <3 ")  #<3 is for heart shape

BOT: My name is Kiti. Let's have a conversation! Also, if you want to exit any time, just type Bye!
hi
BOT: hello
Hello Kiti
BOT: hi
Can I get Student Visa
BOT: the nonimmigrant vocational student (m-1) visa category includes students in established vocational or other recognized nonacademic programs but excludes language training programs.
How about H1B visa
BOT: [4] in 2018, congress further extended the guam and cnmi h-2b and h-1b visa cap exemptions from 2019 to 2029.
Can you tell me about dependent visa
BOT: however, the dependents are not authorized to work in the united states while in the foreign information media representative dependent status.
can dependent get work authorization?
BOT: however, the dependents are not authorized to work in the united states while in the foreign information media representative dependent status.
bye
BOT: Goodbye! Take care <3 
