## NLP Tutorial

NLP - or *Natural Language Processing* - is shorthand for a wide array of techniques designed to help machines learn from text. Natural Language Processing powers everything from chatbots to search engines, and is used in diverse tasks like sentiment analysis and machine translation.

In this tutorial we'll look at this competition's dataset, use a simple technique to process it, build a machine learning model, and submit predictions for a score!

In [1]:
##IMPORTING LIBRARIES###

import numpy as np
import pandas as pd
from sklearn import feature_extraction, linear_model, model_selection, preprocessing

In [2]:
###READING IN THE DATA###

train_df = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
test_df = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")

In [3]:
###QUICK LOOK AT EXAMPLES FROM THE DATA###

print("This is not a disaster tweet:")
exNot = train_df[train_df["target"] == 0]["text"].values[1]
print(exNot)

print("\nThis is a disaster tweet:")
exIs = train_df[train_df["target"] == 1]["text"].values[1]
print(exIs)

This is not a disaster tweet:
I love fruits

This is a disaster tweet:
Forest fire near La Ronge Sask. Canada


In [4]:
###SIMPLISTIC ANALYSIS USING THE FIRST 5 TWEETS###

#Initialize count_vectorizer
count_vectorizer = feature_extraction.text.CountVectorizer()

#First 5 tweets
example_train_vectors = count_vectorizer.fit_transform(train_df["text"][0:5])

#We use .todense() here because these vectors are "sparse" (only non-zero elements are kept to save space)
print(example_train_vectors[0].todense().shape)
print(example_train_vectors[0].todense())

(1, 54)
[[0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0
  0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0]]


In [5]:
###INITIALIZE VECTORS FOR ALL TWEETS###

#Initialize count_vectorizer
train_vectors = count_vectorizer.fit_transform(train_df["text"])

#Ensure that the train and test vectors use the same set of tokens with .transform()
test_vectors = count_vectorizer.transform(test_df["text"])

In [6]:
###INITIALIZE MODEL AND CROSS-VALIDATE SCORE###

#Ridge Regression
ridgeC = linear_model.RidgeClassifier()
scores = model_selection.cross_val_score(ridgeC, train_vectors, train_df["target"], cv=3, scoring="f1")
print(scores)
ridgeC.fit(train_vectors, train_df["target"])

[0.59485531 0.56498283 0.64149093]


RidgeClassifier()

In [7]:
###PREPARE THE SUBMISSION###

sample_submission = pd.read_csv("/kaggle/input/nlp-getting-started/sample_submission.csv")
sample_submission["target"] = ridgeC.predict(test_vectors)
sample_submission.head()
sample_submission.to_csv("submission.csv", index=False)