# Introduction to Natural Language Processing

Natural Language Processing is a set of Techniques that are used to analyse the text data and also help machines learn from the text. NLTK is the name of the library that is the most common name in the world of Text Analytics or NLP.

In this tutorial, we would look into tweets and I will be using a simple method from scikit learn to develop a machine learning model.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

### Data Dictionary

1. text - It represents the text of the tweet

2. keyword - A "Particular Word" from that tweet (although this may be blank!)
About Keyword: Keyword targeting allows you to connect with users based on words and phrases they've recently Tweeted or searched for on Twitter. This marketing capability allows you to reach your target audience when your business is most relevant to them.

3. Location - Location of the Tweet

4. Target - You are predicting whether a given tweet is about a real disaster or not. 1 rep Disaster Tweet and 0 Represents Not a Disaster

In [None]:
train = pd.read_csv("../input/nlp-getting-started/train.csv")
test = pd.read_csv("../input/nlp-getting-started/test.csv")

In [None]:
train.head()

In [None]:
print(train.shape, test.shape)

In [None]:
# Lets analyse what tweets are not disaster tweets
train.loc[train.target==0,"text"]

In [None]:
# Lets Analyze Disaster Tweets
train.loc[train.target==1,"text"].values[1:]

## Count Vectorizer
Now since we know that the whether a tweet is disaster or not, depends on the Words used in the tweet. Hence, we will know make the count of the words that can be easily fed into the machine learning model at a later stage.

Here, we will be using the CountVectorizer from Scikit Learn to do the job.

Lets see an example of the Count Vectorizer before we put it to use.

#### Explanation of Count Vectorizer
The CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.

You can use it as follows:

1. Create an instance of the CountVectorizer class.
2. Call the fit() function in order to learn a vocabulary from one or more documents.
3. Call the transform() function on one or more documents as needed to encode each as a vector.

An encoded vector is returned with a length of the entire vocabulary and an integer count for the number of times each word appeared in the document.

Information Source: https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/


In [None]:
# Importing Count Vectorizer...
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer()

In [None]:
# Example of Count Vectorizer
text = ["Jack and Jill went up the Hill!"]
count_vectorizer.fit(text) # fit function helps learn a vocabulary about the text
transformed = count_vectorizer.transform(text) # encodes the text/doc into a vector

The vectors returned from a call to transform() will be sparse vectors, and you can transform them back to numpy arrays to look and better understand what is going on by calling the toarray() function.

In [None]:
print(transformed.shape)
print(type(transformed))
print(transformed.toarray())

In order to understand what has been transformed, we can call vocabulary

In [None]:
print(count_vectorizer.vocabulary_)

# Observations: Punctuations are ignored and all words are converted into Lower Case

In [None]:
# Test with another word
print(count_vectorizer.transform(["Jack"]).toarray()) # able to recognize the word in upper case | Location is 2
print(count_vectorizer.transform(["and"]).toarray()) # Loc is 0 as per above vocabulary
print(count_vectorizer.transform(["Jill"]).toarray())
print(count_vectorizer.transform(["Mukul Singh"]).toarray()) # No words found and hence all 0

In [None]:
# lets get the count of first 5 tweets
exmple  = count_vectorizer.fit_transform(train["text"][0:5])

print(exmple[0].todense().shape)
print(exmple[0].todense())

1. This means that there 54 tokens/unique words in the first 5 tweets
2. The first vector has been printed for the first tweet.

In [None]:
print(list(count_vectorizer.vocabulary_))
print("Unique Words are: ")
print(np.unique(list(count_vectorizer.vocabulary_)))

Lets Create a Vector for all the tweets

In [None]:
# Train Set
alltweets = count_vectorizer.fit_transform(train["text"]) # Transformed the Train Tweets

Now, since we want to map those tweets/words in test set which are there in train and hence, we will not be using fit transform, instead we will use transform only to do the job.

In [None]:
# Test Set
testtweets = count_vectorizer.transform(test["text"])

### Modelling

Now we will be starting with the model. Since, we have a target variable that has 0 and 1 meaning it is a classification problem and hence we will be using a RidgeClassifier along with Random Forest and Gradient Boosting Model as a base ensemble model.

In [None]:
from sklearn.linear_model import RidgeClassifier
from sklearn.model_selection import cross_val_score
ridge = RidgeClassifier()

In [None]:
print(cross_val_score(ridge, alltweets, train.target, cv = 5, scoring ="f1").mean())

In [None]:
from sklearn.ensemble import VotingClassifier, RandomForestClassifier, GradientBoostingClassifier
gbm = GradientBoostingClassifier()
rf = RandomForestClassifier()
vc = VotingClassifier(estimators = [("rf", rf), ("ridge", ridge), ("GBM", gbm)])

In [None]:
vc.fit(alltweets, train.target)

In [None]:
solution = pd.DataFrame({"id": test.id, "target": vc.predict(testtweets)})
solution.to_csv("VC Model.csv", index=False) # Kaggle: 0.78016

Now, Since we see that the F1 score comes out to be 0.7816 which is a good start for the modelling. Having said that, we can try different models such as Neural Net, LSTM, TFIDF etc to improve the accuracy of the model.

The work is inspired from https://www.kaggle.com/philculliton/nlp-getting-started-tutorial